GitHub hosts over 35 million repositories and over 30 million Gists on hundreds of servers. Over the past year, we’ve built DGit, a new distributed storage system that dramatically improves the availability, reliability, and performance of serving and storing Git content.
DGit is short for “Distributed Git.” As many readers already know, Git itself is distributed—any copy of a Git repository contains every file, branch, and commit in the project’s entire history. DGit uses this property of Git to keep three copies of every repository, on three different servers. The design of DGit keeps repositories fully available without interruption even if one of those servers goes down. Even in the extreme case that two copies of a repository become unavailable at the same time, the repository remains readable; i.e., fetches, clones, and most of the web UI continue to work.
DGit performs replication at the application layer, rather than at the disk layer. Think of the replicas as three loosely-coupled Git repositories kept in sync via Git protocols, rather than identical disk images full of repositories. This design gives us great flexibility to decide where to store the replicas of a repository and which replica to use for read operations.
If a file server needs to be taken offline, DGit automatically determines which repositories are left with fewer than three replicas and creates new replicas of those repositories on other file servers. This “healing” process uses all remaining servers as both sources and destinations. Since healing throughput is N-by-N, it is quite fast. And all this happens without any downtime.
Most end users store their Git repositories as objects, pack files, and
references in a single
.git directory. They access the repository using
the Git command-line client or using graphical clients like GitHub Desktop
or the built-in support for Git in IDEs like Visual Studio. Perhaps it’s
surprising that GitHub’s repository-storage tier, DGit, is built using the same
technologies. Why not a SAN? A distributed file system? Some other magical
cloud technology that abstracts away the problem of storing bits durably?
The answer is simple: it’s fast and it’s robust.
Git is very sensitive to latency. A simple
git log or
git blame might
require thousands of Git objects to be loaded and traversed sequentially. If there’s any
latency in these low-level disk accesses, performance suffers dramatically.
Thus, storing the repository in a distributed file system is not viable. Git
is optimized to be fast when accessing fast disks, so the DGit file servers
store repositories on fast, local SSDs.
At a higher level, Git is also optimized to exchange updates between Git repositories (e.g., pushes and fetches) over efficient protocols. So we use these protocols to keep the DGit replicas in sync.
Git is a mature and well-tested technology. Why reinvent the wheel when there is a Formula One racing car already available?
It has always been GitHub’s philosophy to use Git on our servers in a manner that is as close as possible to how Git is used by our users. DGit continues this tradition. If we find performance bottlenecks or other problems, we have several core Git and libgit2 contributors on staff who fix the problems and contribute the fixes to the open-source project for everybody to use. Our level of experience and expertise with Git helped make it the obvious choice to use for DGit replication.
Until recently, we kept copies of repository data using off-the-shelf, disk-layer replication technologies—namely, RAID and DRBD. We organized our file servers in pairs. Each active file server had a dedicated, online spare connected by a cross-over cable. Each disk had four copies: two copies on the main file server, using RAID, and another two copies on that file server’s hot spare, using DRBD. If anything went wrong with a file server—e.g., hardware failure, software crash, or an overload situation—a human would confirm the fault and order the spare to take over. Thus, there was a good level of redundancy, but the failover process required manual intervention and inevitably caused a little bit of downtime for the repositories on the failed server. To make such incidents as rare as possible, we have always stored repositories on specialized, highly reliable servers.
Now with DGit, each repository is stored on three servers chosen independently in locations distributed around our large pool of file servers. DGit automatically selects the servers to host each repository, keeps those replicas in sync, and picks the best server to handle each incoming read request. Writes are synchronously streamed to all three replicas and are only committed if at least two replicas confirm success.
GitHub now stores repositories in a cluster called
dfs is short
for “DGit file server.” The repositories are stored on
local disks on these file servers and are served up by Git and libgit2.
The clients of this cluster include the web front end and the proxies that
speak to users’ Git clients.
DGit delivers many advantages both to GitHub users and to the internal GitHub infrastructure team. It is also a key foundation that will enable more upcoming innovations.
File servers no longer have to be deployed as pairs of identical servers, located near each other, and connected one-to-one by crossover cables. We can now use a pool of heterogeneous file servers in whatever spatial configuration is best.
When an entire server fails, replacing it used to be urgent, because its backup server was operating with no spare. A two-server outage could take hundreds of thousands of repositories offline. Now when a server fails, DGit quickly makes new copies of the repositories that it hosted and automatically distributes them throughout the cluster.
Routing around failure gets much less disruptive. Rather than having to reboot and resynchronize a whole server, we just stop routing traffic to it until it recovers. It’s now safe to reboot production servers, with no transition period. Because server outages are less disruptive in DGit, we no longer have to wait for humans to confirm an outage; we can route around it immediately.
We no longer need to keep hot-spare file servers that sit mostly idle. In
DGit, every CPU and all memory is available for handling user traffic. While
write operations like pushes must go to every replica of a repository, read
operations can be served from any replica. Since read operations far
outnumber writes, every repository can now handle nearly three times the
traffic it could previously. The graph below shows the CPU load due to
git processes on old file servers (blue) and on DGit servers (green).
The blue line is the average of only the active servers; their hot spares
are not included. The load on the DGit servers is lower: approximately
three times lower at the peaks, and roughly two times lower at the
troughs. The troughs don’t show a 3x improvement because all the file
servers have background maintenance tasks that cannot be divided among the
DGit reduces fate sharing among repositories. Prior to DGit, a fixed set of repositories were stored together on a single server and on that server’s spare. If one repository was too big, or too expensive, or too popular, the other repositories on that file server could become slow. In DGit, those other repositories can be served from their other replicas, which are very unlikely to be on the same servers as the other replicas of the busy repository.
The decoupling of replicas means we can put replicas of a repository in different availability zones, or even in different data centers. Availability is improved, and we can (eventually) serve users’ content from servers geographically close to those users.
DGit is a big change, and so we’ve been rolling it out gradually. The most complicated aspect of DGit is that replication is no longer transparent: every repository is now stored explicitly on three servers, rather than on one server with an automatically synchronized hot spare. Thus, DGit must implement its own serializability, locking, failure detection, and resynchronization, rather then relying on DRBD and the RAID controller to keep the copies in sync. Those are rich topics that we’ll explore in later posts; suffice it to say, we wanted to test them all thoroughly before relying on DGit to store customer data. Our deployment progressed over many steps:
During the rollout phase we routinely powered down servers, sometimes several at once, while they were serving live production traffic. User operations were not affected.
As of this writing, 58% of repositories and 96% of Gists, representing 67% of Git operations, are in DGit. We are moving the rest as quickly as we can turn pre-DGit file server pairs into DGit servers.
GitHub always strives to make fetching, pushing, and viewing repositories quick and reliable. DGit is how our repository-storage tier will meet those goals for years to come while allowing us to scale horizontally and increase fault tolerance.
Over the next month we will be following up with in-depth posts on the technology behind DGit.