GitHub Engineering

GitHub uses MySQL to store its metadata: Issues, Pull Requests, comments, organizations, notifications and so forth. While git repository data does not need MySQL to exist and persist, GitHub’s service does. Authentication, API, and the website itself all require the availability of our MySQL fleet.

Our replication topologies span multiple data centers and this poses a challenge not only for availability but also for manageability and operations.

Automated failovers

We use a classic MySQL master-replicas setup, where the master is the single writer, and replicas are mainly used for read traffic. We expect our MySQL fleet to be available for writes. Placing a review, creating a new repository, adding a collaborator, all require write access to our backend database. We require the master to be available.

To that effect we employ automated master failovers. The time it would take a human to wake up & fix a failed master is beyond our expectancy of availability, and operating such a failover is sometimes non-trivial. We expect master failures to be automatically detected and recovered within 30 seconds or less, and we expect failover to result with minimal loss of available hosts.

We also expect to avoid false positives and false negatives. Failing over when there’s no failure is wasteful and should be avoided. Not failing over when failover should take place means an outage. Flapping is unacceptable. And so there must be a reliable detection mechanism that makes the right choice and takes a predictable course of action.

orchestrator

We employ Orchestrator to manage our MySQL failovers. orchestrator is an open source MySQL replication management and high availability solution. It observes MySQL replication topologies, auto-detects topology layout and changes, understands replication rules across configurations and versions, detects failure scenarios and recovers from master and intermediate master failures.

3 datacenter topology

Failure detection

orchestrator takes a different approach to failure detection than the common monitoring tools. The common way to detect master failure is by observing the master: via ping, via simple port scan, via simple SELECT query. These tests all suffer from the same problem: What if there’s an error?

Network glitches can happen; the monitoring tool itself may be network partitioned. The naive solutions are along the lines of “try several times at fixed intervals, and on the n-th successive failure, assume master is failed”. While repeated polling works, they tend to lead to false positives and to increased outages: the smaller n is (or the smaller the interval is), the more potential there is for a false positive: short network glitches will cause for unjustified failovers. However larger n values (or longer poll intervals) will delay a true failure case.

A better approach employs multiple observers, all of whom, or the majority of whom must agree that the master has failed. This reduces the danger of a single observer suffering from network partitioning.

orchestrator uses a holistic approach, utilizing the replication cluster itself. The master is not an isolated entity. It has replicas. These replicas continuously poll the master for incoming changes, copy those changes and replay them. They have their own retry count/interval setup. When orchestrator looks for a failure scenario, it looks at the master and at all of its replicas. It knows what replicas to expect because it continuously observes the topology, and has a clear picture of how it looked like the moment before failure.

orchestrator seeks agreement between itself and the replicas: if orchestrator cannot reach the master, but all replicas are happily replicating and making progress, there is no failure scenario. But if the master is unreachable to orchestrator and all replicas say: “Hey! Replication is broken, we cannot reach the master”, our conclusion becomes very powerful: we haven’t just gathered input from multiple hosts. We have identified that the replication cluster is broken de-facto. The master may be alive, it may be dead, may be network partitioned; it does not matter: the cluster does not receive updates and for all practical purposes does not function. This situation is depicted in the image below:

3 datacenter topology

Masters are not the only subject of failure detection: orchestrator employs similar logic to intermediate masters: replicas which happen to have further replicas of their own.

Furthermore, orchestrator also considers more complex cases as having unreachable replicas or other scenarios where decision making turns more fuzzy. In some such cases, it is still confident to proceed to failover. In others, it suffices with detection notification only.

We observe that orchestrator’s detection algorithm is very accurate. We spent a few months in testing its decision making before switching on auto-recovery.

Failover

Once the decision to failover has been made, the next step is to choose where to failover to. That decision, too, is non trivial.

In semi-sync replication environments, which orchestrator supports, one or more designated replicas are guaranteed to be most up-to-date. This allows one to guarantee one or more servers that would be ideal to be promoted. Enabling semi-sync is on our roadmap and we use asynchronous replication at this time. Some updates made to the master may never make it to any replicas, and there is no guarantee as for which replica will get the most recent updates. Choosing the most up-to-date replica means you lose the least data. However in the world of operations not all replicas are created equal: at any given time we may be experimenting with a recent MySQL release, that we’re not ready yet to put to production; or may be transitioning from STATEMENT based replication to ROW based; or have servers in a remote data center that preferably wouldn’t take writes. Or you may have a designated server of stronger hardware that you’d like to promote no matter what.

orchestrator understands all replication rules and picks a replica that makes most sense to promote based on a set of rules and the set of available servers, their configuration, their physical location and more. Depending on servers’ configuration, it is able to do a two-step promotion by first healing the topology in whatever setup is easiest, then promoting a designated or otherwise best server as master.

We build trust in the failover procedure by continuously testing failovers. We intend to write more on this in a later post.

Anti-flapping and acknowledgements

Flapping is strictly unacceptable. To that effect orchestrator is configured to only perform one automated failover for any given cluster in a preconfigured time period. Once a failover takes place, the failed cluster is marked as “blocked” from further failovers. This mark is cleared after, say, 30 minutes, or until a human says otherwise.

To clarify, an automated master failover in the middle of the night does not mean stakeholders get to sleep it over. Pages will arrive, even as failover takes place. A human will observe the state, and may or may not acknowledge the failover as justified. Once acknowledged, orchestrator forgets about that failover and is free to proceed with further failovers on that cluster should the case arise.

Topology management

There’s more than failovers to orchestrator. It allows for simplified topology management and visualization.

We have multiple clusters of differing size, that span multiple datacenters (DCs). Consider the following:

3 datacenter topology

The different colors indicate different data centers, and the above topology spans three DCs. Cross-DC network has higher latency and network calls are more expensive than within the intra-DC network, and so we typically group DC servers under a designated intermediate master, aka local DC master, and reduce cross-DC network traffic. In the above instance-64bb (blue, 2nd from bottom on the right) could replicate from instance-6b44 (blue, bottom, middle) and free up some cross-DC traffic.

This design leads to more complex topologies: replication trees that go deeper than one or two levels. There are more use cases to having such topologies:

  • Experimenting with a newer version: to test, say, MySQL 5.7 we create a subtree of 5.7 servers, with one acting as an intermediate master. This allows us to test 5.7 replication flow and speed.
  • Migrating from STATEMENT based replication to ROW based replication: we again migrate slowly by creating subtrees, adding more and more nodes to those trees until they consume the entire topology.
  • By way of simplifying automation: a newly provisioned host, or a host restored from backup, is set to replicate from the backup server whose data was used to restore the host.
  • Data partitioning is achieved by incubating and splitting out new clusters, originally dangling as sub-clusters then becoming independent.

Deep nested replication topologies introduce management complexity:

  • All intermediate masters turn to be point of failure for their nested subtrees.
  • Recoveries in mixed-versions topologies or mixed-format topologies are subject to cross-version or cross-format replication constraints. Not any server can replicate from any other.
  • Maintenance requires careful refactoring of the topology: you can’t just take down a server to upgrade its hardware; if it serves as a local/intermediate master taking it offline would break replication on its own replicas.

orchestrator allows for easy and safe refactoring and management of such complex topologies:

  • It can failover dead intermediate masters, eliminating the “point of failure” problem.
  • Refactoring (moving replicas around the topology) is made easy via GTID or Pseudo-GTID (an application level injection of sparse GTID-like entries).
  • orchestrator understands replication rules and will refuse to place, say, a 5.6 server below a 5.7 server.

orchestrator also serves as the de-facto topology state/inventory indicator. It complements puppet or service discoveries configuration which imply desired state, by actually observing the existing state. State is queryable at various levels, and we employ orchestrator at some of our automation tasks.

Chatops integration

We love our chatops as they make our operations visible and accessible to our greater group of engineers. While the orchestrator service provides a web interface, we rarely use it; one’s browser is her own private command center, with no visibility to others and no history.

We rely on chatops for most operations. As a quick example of visibility we get by chatops, let’s examine a cluster:

shlomi-noach shlomi-noach
.orc cluster sample-cluster
hubot Hubot
host                     lag  status  version          mode  format     extra
---                      ---  ---     ---              ---   ---        ---
instance-e854             0s  ok      5.6.26-74.0-log  rw    STATEMENT  >>,P-GTID
+ instance-fadf           0s  ok      5.6.26-74.0-log  ro    STATEMENT  >>,P-GTID
  + instance-9d3d         0s  ok      5.6.31-77.0-log  ro    STATEMENT  >>,P-GTID
  + instance-8125         0s  ok      5.6.31-77.0-log  ro    STATEMENT  >>,P-GTID
+ instance-b982           0s  ok      5.6.26-74.0-log  ro    STATEMENT  >>,P-GTID
+ instance-c5a7           0s  ok      5.6.31-77.0-log  ro    STATEMENT  >>,P-GTID
  + instance-64bb         0s  ok      5.6.31-77.0-log  rw    nobinlog   P-GTID
+ instance-6b44           0s  ok      5.6.31-77.0-log  rw    STATEMENT  >>,P-GTID
  + instance-cac3     14400s  ok      5.6.31-77.0-log  rw    STATEMENT  >>,P-GTID

Say we wanted to upgrade instance-fadf to 5.6.31-77.0-log. It has two replicas attached, that I don’t want to be affected. We can:

shlomi-noach shlomi-noach
.orc relocate-replicas instance-fadf below instance-c5a7
hubot Hubot
instance-9d3d
instance-8125

To the effect of:

shlomi-noach shlomi-noach
.orc cluster sample-cluster
hubot Hubot
host                     lag  status  version          mode  format     extra
---                      ---  ---     ---              ---   ---        ---
instance-e854             0s  ok      5.6.26-74.0-log  rw    STATEMENT  >>,P-GTID
+ instance-fadf           0s  ok      5.6.26-74.0-log  ro    STATEMENT  >>,P-GTID
+ instance-b982           0s  ok      5.6.26-74.0-log  ro    STATEMENT  >>,P-GTID
+ instance-c5a7           0s  ok      5.6.31-77.0-log  ro    STATEMENT  >>,P-GTID
  + instance-9d3d         0s  ok      5.6.31-77.0-log  ro    STATEMENT  >>,P-GTID
  + instance-8125         0s  ok      5.6.31-77.0-log  ro    STATEMENT  >>,P-GTID
  + instance-64bb         0s  ok      5.6.31-77.0-log  rw    nobinlog   P-GTID
+ instance-6b44           0s  ok      5.6.31-77.0-log  rw    STATEMENT  >>,P-GTID
  + instance-cac3     14400s  ok      5.6.31-77.0-log  rw    STATEMENT  >>,P-GTID

The instance is now free to be taken out of the pool.

Other actions are available to us via chatops. We can force a failover, acknowledge recoveries, query topology structure etc. orchestrator further communicates with us on chat, and notifies us in the event of a failure/recovery.

orchestrator also runs as a command-line tool, and the orchestrator service supports web API, and so can easily participate in automated tasks.

orchestrator @ GitHub

GitHub has adopted orchestrator, and will continue to improve and maintain it. The github repo will serve as the new upstream and will accept issues and pull requests from the community.

orchestrator continues to be free and open source, and is released under the Apache License 2.0.

Migrating the project to the GitHub repo had the unfortunate result of diverging from the original Outbrain repo, due to the way import paths are coupled with repo URI in golang. The two diverged repositories will not be kept in sync; and we took the opportunity to make some further diverging changes, though made sure to keep API & command line spec compatible. We’ll keep an eye for incoming Issues on the Outbrain repo.

Outbrain

It is our pleasure to acknowledge Outbrain as the original author of orchestrator. The project originated at Outbrain while seeking to manage a growing fleet of servers in three data centers. It began as a means to visualize the existing topologies, with minimal support for refactoring, and came at a time where massive hardware upgrades and datacenter changes were taking place. orchestrator was used as the tool for refactoring and for ensuring topology setups went as planned and without interruption to service, even as servers were being provisioned or retired.

Later on Pseudo-GTID was introduced to overcome the problems of unreachable/crashing/lagging intermediate masters, and shortly afterwards recoveries came into being. orchestrator was put to production in very early stages and worked on busy and sensitive systems.

Outbrain was happy to develop orchestrator as a public open source project and provided the resources to allow its development, not only to the specific benefits of the company, but also to the wider community. Outbrain authors many more open source projects, which can be found on their GitHub’s Outbrain engineering page.

We’d like to thank Outbrain for their contributions to orchestrator, as well as for their openness to having us adopt the project.

Further acknowledgements

orchestrator was later developed at Booking.com. It was brought in to improve on the existing high availability scheme. orchestrator’s flexibility allowed for simpler hardware setup and faster failovers. It was fortunate to enjoy the large MySQL setup Booking.com employs, managing various MySQL vendors, versions, configurations, running on clusters ranging from a single master to many hundreds of MySQL servers and Binlog Servers on multiple data centers. Booking.com continuously contributes to orchestrator.

We’d like to further acknowledge major community contributions made by Google/Vitess (orchestrator is the failover mechanism used by Vitess), and by Square, Inc.

We’ve released a public Puppet module for orchestrator, authored by @tomkrouper. This module sets up the orchestrator service, config files, logging etc. We use this module within our own puppet setup, and actively maintain it.

Chef users, please consider this Chef cookbook by @silviabotros.

shlomi-noach

Senior Infrastructure Engineer

How we made diff pages three times faster Moving persistent data out of Redis