At GitHub we serve billions of HTTP, Git and SSH connections each day. To get the best performance we run on bare metal hardware. Historically one of the more complex components has been our load balancing tier. Traditionally we scaled this vertically, running a small set of very large machines running haproxy, and using a very specific hardware configuration allowing dedicated 10G link failover. Eventually we needed a solution that was scalable and we set out to create a load balancer solution that would run on commodity hardware in our typical data center configuration.
Over the last year we’ve developed our new load balancer, called GLB (GitHub Load Balancer). Today, and over the next few weeks, we will be sharing the design and releasing its components as open source software.
GitHub is growing and our monolithic, vertically scaled load balancer tier had met its match and a new approach was required. Our original design was based around a small number of large machines each with dedicated links to our network spine. This design tied networking gear, the load balancing hosts and load balancer configuration together in such a way that scaling horizontally was deemed too difficult. We set out to find a better way.
We first identified the goals of the new system, design pitfalls of the existing system and prior art that we could draw experience and inspiration from. After some time we determined that the following would produce a successful load balancing tier that we could maintain into the future:
To achieve these goals we needed to rethink the relationship between IP addresses and hosts, the constituent layers of our load balancing tier and how connections are routed, controlled and terminated.
In a typical setup, you assign a single public facing IP address to a single physical machine. DNS can then be used to split traffic over multiple IPs, letting you shard traffic across multiple servers. Unfortunately, DNS entries are cached fairly aggressively (often ignoring the TTL), and some of our users may specifically whitelist or hardcode IP addresses. Additionally, we offer a certain set of IPs for our Pages service which customers can use directly for their apex domain. Rather than relying on adding additional IPs to increase capacity, and having an IP address fail when the single server failed, we wanted a solution that would allow a single IP address to be served by multiple physical machines.
Routers have a feature called Equal-Cost Multi-Path (ECMP) routing, which is designed to split traffic destined for a single IP across multiple links of equal cost. ECMP works by hashing certain components of an incoming packet such as the source and destination IP addresses and ports. By using a consistent hash for this, subsequent packets that are part of the same TCP flow will hash to the same path, avoiding out of order packets and maintaining session affinity.
This works great for routing packets across multiple paths to the same physical destination server. Where it gets interesting is when you use ECMP to split traffic destined for a single IP across multiple physical servers, each of which terminate TCP connections but share no state, like in a load balancer. When one of these servers fails or is taken out of rotation and is removed from the ECMP server set a rehash event occurs. 1/N connections will get reassigned to the remaining servers. Since these servers don’t share connection state these connections get terminated. Unfortunately, these connections may not be the same 1/N connections that were mapped to the failing server. Additionally, there is no way to gracefully remove a server for maintenance without also disrupting 1/N active connections.
A pattern that has been used by other projects is to split the load balancers into a L4 and L7 tier. At the L4 tier, the routers use ECMP to shard traffic using consistent hashing to a set of L4 load balancers - typically using software like ipvs/LVS. LVS keeps connection state, and optionally syncs connection state with multicast to other L4 nodes, and forwards traffic to the L7 tier which runs software such as haproxy. We call the L4 tier “director” hosts since they direct traffic flow, and the L7 tier “proxy” hosts, since they proxy connections to backend servers.
This L4/L7 split has an interesting benefit: the proxy tier nodes can now be removed from rotation by gracefully draining existing connections, since the connection state on the director nodes will keep existing connections mapped to their existing proxy server, even after they are removed from rotation for new connections. Additionally, the proxy tier tends to be the one that requires more upkeep due to frequent configuration changes, upgrades and scaling so this works to our advantage.
If the multicast connection syncing is used, then the L4 load balancer nodes handle failure slightly more gracefully, since once a connection has been synced to the other L4 nodes, the connection will no longer be disrupted. Without connection syncing, providing the director nodes hash connections the same way and have the same backend set, connections may successfully continue over a director node failure. In practise, most installations of this tiered design just accept connection disruption under node failure or node maintenance.
Unfortunately, using LVS for the director tier has some significant drawbacks. Firstly, multicast was not something we wanted to support, so we would be relying on the nodes having the same view of the world, and having consistent hashing to the backend nodes. Without connection syncing, certain events, including planned maintenance of nodes, could cause connection disruption. Connection disruption is something we wanted to avoid due to how git cannot retry or resume if the connection is severed mid-flight. Finally, the fact that the director tier requires connection state at all adds an extra complexity to DDoS mitigation such as synsanity - to avoid resource exhaustion, syncookies would now need to be generated on the director nodes, despite the fact that the connections themselves are terminated on the proxy nodes.
We decided early on in the design of our load balancer that we wanted to improve on the common pattern for the director tier. We set out to design a new director tier that was stateless and allowed both director and proxy nodes to be gracefully removed from rotation without disruption to users wherever possible. Users live in countries with less than ideal internet connectivity, and it was important to us that long running clones of reasonably sized repositories would not fail during planned maintenance within a reasonable time limit.
The design we settled on, and now use in production, is a variant of Rendezvous hashing that supports constant time lookups. We start by storing each proxy host and assign a state. These states handle the connection draining aspect of our design goals and will be discussed further in a future post. We then generate a single, fixed-size forwarding table and fill each row with a set of proxy servers using the ordering component of Rendezvous hashing. This table, along with the proxy states, are sent to all director servers and kept in sync as proxies come and go. When a TCP packet arrives on the director, we hash the source IP to generate consistent index into the forwarding table. We then encapsulate the packet inside another IP packet (actually Foo-over-UDP) destined to the internal IP of the proxy server, and send it over the network. The proxy server receives the encapsulated packet, decapsulates it, and processes the original packet locally. Any outgoing packets use Direct Server Return, meaning packets destined to the client egress directly to the client, completely bypassing the director tier.
Now that you have a taste of the system that processed and routed the request to this blog post we hope you stay tuned for future posts describing our director design in depth, improving haproxy hot configuration reloads and how we managed to migrate to the new system without anyone noticing.