Metrics at Uber

2020-02-23

Originally posted as a comment on M3DB, a distributed timeseries database

I setup a lot of Uber's early metrics infrastructure, so I can speak to how they got to the place where building a custom solution was the right answer.

In the beginning, we didn't really have metrics, we had logs. Lots of logs. We tried to use Splunk to get some insight from those. It kinda worked and their sales team initially quoted a high-but-reasonable price for licensing. When we were ready to move forward, the price of the license doubled because they had missed the deadline for their end of quarter sales quota. So we kicked Splunk to the curb.

Having seen that the bulk of our log volume was noise and that we really only cared about a few small numbers, I looked for a metrics solution at this point, not a logs solution. I'd operated RRDtool based systems at previous companies, and that worked okay, but I didn't love the idea of doing that again. I had seen Etsy's blog about statsd and setup a statsd+carbon+graphite instance on a single server just to try out and get feedback from the rest of the engineering team. The team very quickly took to Graphite and started instrumenting various codebases and systems to feed metrics into statsd.

statsd hit capacity problems first, as it was a single threaded nodejs process and used UDP for ingest, so once it approached 100% CPU utilization, events got dropped. We switched to statsite, which is pretty much a drop-in replacement written in C.

The next issue was disk I/O. This was not a surprise. Carbon (Graphite's storage daemon) stores each metric in a separate file in the whisper format, which is similar to RRDtool's files, but implemented in pure Python and generally a bit easier to interact with. We'd expected that a large volume of random write ops on a spinning disk would eventually be a problem. We ordered some SSDs. This worked okay for a while.

At this point, the dispatch system was instrumented to store metrics under keys with a lot of dimensions, so that we could generate per-city, per-process, per-handler charts for debugging and performance optimization. While very useful for drilling down to the cause of an issue, this led to an almost exponential growth in the number of unique metrics we were ingesting. I setup carbon-relay to shard the storage across a few servers- I think there were three, but it was a long time ago. We never really got carbon-relay working well. It didn't handle backend outages and network interruptions very well, and would sometimes start leaking memory and crash, seemingly without reason. It limped along for a while, but wasn't going to be a long-term solution.

We started looking for alternatives to carbon, as we wanted to get away from whisper files... SSDs were still fairly expensive, and we believed that we should be able to store an append-only dataset on spinning disks and do batch sequential writes. The infrastructure team was still fairly small and we didn't have the resources to properly maintain a HBase cluster for OpenTSDB or a Cassandra cluster, which would've required adapting carbon- I understand that Cassandra is a supported backend these days, but it was just an idea on a mailing list at that point.

InfluxDB looked like exactly what we wanted, but it was still in a very early state, as the company had just been formed weeks earlier. I submitted some bug reports but was eventually told by one of the maintainers that it wasn't ready yet and I should quit bugging them so they could get to MVP.

Right around this time, we started having serious availability issues with metrics, both on the storage side- I estimated we were dropping about 60% of incoming statsd events, and on the query side- Graphite would take seconds-to-minutes to render some charts and occasionally would just time out. We had also built an ad-hoc system for generating Nagios checks that would poll Graphite every minute to trigger threshold-based alerts, which would make noise if Graphite was down and the monitored system was not. This led to on-call fatigue, which made everybody unhappy.

We started running an instance of statsite on every server which would aggregate the individual events for that server into 10 second buckets with the server's hostname as a key prefix, then pushed those to carbon-relay. This solved the dropped packets issue, but carbon-relay was still unreliable.

We were pretty entrenched in the statsd+graphite way of doing things at this point, so switching to OpenTSDB wasn't really an option and we'd exhausted all of the existing carbon alternatives, so we started thinking about modifying carbon to use another datastore. The scope of this project was large enough that it wasn't going to get built in a matter of days or weeks, so we needed a stopgap solution to buy time and keep the metrics flowing while we engineered a long term solution.

I hacked together statsrelay, which is basically a re-implementation of carbon-relay in C, using libev. At this point, I was burned out and handed off the metrics infrastructure to a few teammates that ran with statsrelay and turned it into a production quality piece of code. Right around the same time, we'd begun hiring for an engineering team in NYC that would take over responsibility for metrics infrastructure. These are the people that eventually designed and built M3DB.

Next post - Rye bread recipe