One of the key points of GitHub’s engineering culture —and I believe, of any good engineering culture— is our obsession with aggressively measuring everything.
Coda Hale’s seminal talk “Metrics, Metrics Everywhere” has been a cornerstone of our monitoring philosophy. Since the very early days, our engineering team has been extremely performance-focused; our commitment is to ship products that perform as well as they possibly can (“it’s not fully shipped until it’s fast”), and the only way to accomplish this is to reliably and methodically measure, monitor and graph every single part of our production systems. We firmly believe that metrics are the most important tool we have to keep GitHub fast.
In our quest to graph everything, we became one of the early adopters of
statsd is a Metrics (with capital M) aggregation daemon
written in Node.js. Its job is taking complex Metrics and aggregating
them in a server so they can be reported to your storage of choice (in our case,
Graphite) as individual datapoints.
statsd is, by and large, a good idea. Its simple design based around an
UDP socket, and the very Unix-y, text-based format for reporting complex metrics
makes it trivial to measure every nook and cranny of our infrastructure: from the
main Rails app, where each server process reports hundreds of individual metrics for
each request, to the smallest, most trivial command-line tool that we run in our
datacenter, as reporting a simple metric is just one
nc -u invocation away.
However, like most technologies, the idea behind
statsd only stretches up to a point.
Three and a half years ago, as we spun up more and more machines to keep up with GitHub’s
incessant growth, weird things started to happen to our graphs. Engineers started
complaining that specific metrics (particularly, new ones), were not showing up on
dashboards. For some older metrics, their values were simply too low to be right.
In the spirit of aggressively monitoring everything, we used our graphs to debug the
issue with our graphs.
statsd uses a single UDP socket to receive the metrics of
all our infrastructure: the simplicity of this technical decision allowed us to
build a metrics cluster quickly and without hassle, starting with a single metrics
aggregator back when we had very limited resources and very limited programmer time
to grow our company. But this decision had now come back to bite us in the ass. A
quick look at the
/proc/net/dev-generated graphs on the machine made the situation
very obvious: slowly but steadily over time, the percentage of UDP packets that were
being dropped in our monitoring server was increasing. From 3% upwards to 40%. We were
dropping almost half of our metrics!
Before we even considered checking our network for congestion issues, we went to the
most probable source of the problem: the daemon that was actually receiving and
aggregating the metrics. The original
statsd was a Node.js application, back in the days
when Node.js was still new and with limited tooling available.
After some tweaking with V8, we managed to profile
statsd and tracked down the bottleneck to
a critical area of the code: the UDP socket that served as only input for the daemon.
The single-thread event-based nature of Node was a rather poor fit for the single socket;
its high throughput made polling essentially a waste of cycles,
and the reader callback API (again, enforced by Node’s event based design)
was causing unnecessary GC pressure by creating individual “message” objects for each
Node.js has always been a foreign technology at GitHub (over the years we’ve taken down
the few critical Node.js services we had running, and we’ve rewritten them in languages
we have more expertise working with). Because of this, rather than trying to dig into
statsd itself, our first approach to fix the performance issues was a more pragmatic one:
statsd process was bottlenecking on CPU, we decided to load balance the incoming
metrics between several processes. It quickly became obvious, however, that load-balancing metric
packets is not possible with the usual tools (e.g. HAproxy) because the consistent hashing
must be performed on the metric name, not the source IP.
The whole point of a metric aggregator is to unify metrics so they get reported only once
to the storage backend. With several aggregators running, balancing metrics between
them based on their source IP will cause the same metric to be aggregated in more than one
daemon, and hence reported more than once to the backend. We’re back to the intial problem
statsd was trying to solve.
To work around this, we wrote a custom load balancer that parsed the metric packets and used consistent hashing on the metric’s name to balance it between a cluster of StatsD instances. This was an improvement, but not quite the results we were looking for.
statsd instances that couldn’t handle our metrics load and a load balancer that was
already essentially “parsing” the metrics packets, we decided to take a more drastic
approach and design the way we were aggregating metrics from a clean slate by writing a
statsd-compatible daemon called Brubeck.
Taking an existing application and rewriting it in another language very rarely gives good results. Especially in the case of a Node.js server, you go from having an event loop written in C (libuv, the cross-platform library that powers the Node.js event framework is written in C), to having, well… an event loop written in C.
A straight port of
statsd to C would hardly offer the performance improvement we required.
Instead of micro-optimizing a straight port to squeeze performance out of it, we focused on
redesigning the architecture of the application so it became efficient, and then implemented
it as simply as possible: that way, the app will run fast even with few optimizations, and
the code will be less complex and hence more reliable.
The first thing we changed in Brubeck was the event-loop based approach of the original
statsd. Evented I/O on a single socket is a waste of cycles; while receiving 4 million
packets per second, polling for read events will give you unsurprisingly predictable results:
there is always a packet ready to be read. Because of this, we replaced the event loop with several worker threads sharing a single listen socket.
Several threads working on aggregating the same metrics means that access to the metrics table needs to be synchronized. We used a modified version of a concurrent, read-lock-free hash table with optimistic locking on writes optimized for applications with a high read-to-write ratios, which performs exceedingly well for our use case.
The individual metrics were synchronized for aggregation with a simple spinlock per metric, which also fit our use case rather well, as we have almost no contention on the individual metrics and perform quick operations on them.
The simple, multi-threaded design of the daemon got us very far: during the last two years, our single Brubeck instance has scaled to up to 4.3 million metrics per second (with no packet drops even at peak hours — as we’ve always intended).
On top of that, last year we added a lot more headroom to our future growth by enabling the
SO_REUSEPORT, a new flag available in Linux 3.9+ that allows several threads to
bind the same port simultaneously instead of having to race for the same socket, having the
kernel round-robin between them. This decreases user-land contention incredibly, simplifies
our code and sets us up for almost NIC-bound performance. The only possible next step would
be skipping the kernel network stack, but we’re not quite there yet.
After three years running in our production servers (with very good results both in performance and reliability), today we’re finally releasing Brubeck as open-source.
Brubeck is a thank-you gift to the monitoring community for all the insight and invaluable open-source tooling that has been developed over the last couple years, and without which bootstrapping and keeping GitHub running and monitored would have been significantly more difficult.
Please skim the technical documentation of the daemon, carefully read and acknowledge its limitations, and figure out whether it’s of any use to you.