Schedule rides in advance
M3: Uber’s Open Source, Large-scale Metrics Platform for Prometheus

August 7, 2018 / Global
Share
Facebook
Twitter
Linkedin
Envelope
To facilitate the growth of Uber’s global operations, we need to be able to quickly store and access billions of metrics on our back-end systems at any given time. As part of our robust and scalable metrics infrastructure, we built M3, a metrics platform that has been in use at Uber for several years now.
M3 reliably houses large-scale metrics over long retention time windows. To offer others in the broader community these benefits, we decided to open source the M3 platform as a remote storage backend for Prometheus, a popular monitoring and alerting solution. As its documentation states, Prometheus’ scalability and durability is limited by single nodes. The M3 platform aims to provide a turnkey, scalable, and configurable multi-tenant store for Prometheus metrics. 
As part of this effort, we recently released M3DB, the scalable storage backend for M3. M3DB is a distributed time series store and reverse index with configurable out-of-order writes. 
Additionally, we are open sourcing M3 Coordinator, a Prometheus sidecar which provides a global query and storage interface on top of M3DB clusters. The M3 Coordinator also performs downsampling, as well as ad hoc retention and aggregation of metrics using retention and rollup rules. This allows us to apply specific retention and aggregations to subsets of metrics on the fly with the rules stored in etcd, which runs embedded in the binary of an M3DB seed node.
Figure 1: The M3 Coordinator sidecar is deployed alongside Prometheus near local regional M3DB clusters that are replicated across availability zones, running etcd in the same binary. 
Metrics before M3
By late 2014, all services, infrastructure, and servers at Uber emitted metrics to a Graphite stack that stored them using the Whisper file format in a sharded Carbon cluster. We used Grafana for dashboarding and Nagios for alerting, issuing Graphite threshold checks via source-controlled scripts. While this worked for a while, expanding the Carbon cluster required a manual resharding process and, due to lack of replication, any single node’s disk failure caused permanent loss of its associated metrics. In short, this solution was not able to meet our needs as the company continued to grow.  
To ensure the scalability of Uber’s metrics backend, we decided to build out a system that provided fault tolerant metrics ingestion, storage, and querying as a managed platform. Our goals were fivefold: 
Improved reliability and scalability: to ensure we can continue to scale the business without worrying about loss of availability or accuracy of alerting and observability.
Capability for queries to return cross-data center results: to seamlessly enable global visibility of services and infrastructure across regions.
Low latency service level agreement: to make sure that dashboards and alerting provide a reliable query latency that is interactive and responsive.
First-class dimensional “tagged” metrics: to offer the flexible, tagged data model that Prometheus’ labels and other systems made popular.
Backwards compatibility: to guarantee that hundreds of legacy services emitting StatsD and Graphite metrics continue to function without interruption.
 
Introducing M3
After evaluating open source alternatives to our existing solution, we found that none would meet our goals for resource efficiency or scale and be able to run as a self-service platform. At first M3 leveraged almost entirely open source components for essential roles such as statsite for aggregation, Cassandra with Date Tiered Compaction Strategy for time series storage, and ElasticSearch for indexing. Due to operational burden, cost efficiency, and a growing feature set, we gradually outgrew each one.
Released in 2015, M3 now houses over 6.6 billion time series. M3 aggregates 500 million metrics per second and persists 20 million resulting metrics-per-second to storage globally (with M3DB), using a quorum write to persist each metric to three replicas in a region. It also lets engineers author metric policies that tell M3 to store certain metrics at shorter or longer retentions (two days, one month, six months, one year, three years, five years, etc.) and at a specific granularity (one second, ten seconds, one minute, ten minutes, etc). This allows engineers and data scientists to intelligently store time series at different retentions with both fine and coarse-grained scopes using metrics tag (label) matching to defined storage policies.  For example, engineers can choose to store all metrics where the “application” tag is “mobile_api” and “endpoint” tag is “signup” for both 30 days at ten seconds granularity and five years at one hour granularity.
Integration with Prometheus continues to be an increasingly important priority for Uber’s M3 users, both in terms of providing observability for any application that exports Prometheus metrics and for systems monitoring using node_exporter or other third party Prometheus metrics exporters. Historically, several teams at Uber ran significant Prometheus deployments, and when operational burden on the teams became untenable, we built integrations into M3 for Prometheus, allowing these teams to query their metrics in a global view across all data centers in M3’s durable long-term metric store.
Figure 2: The M3 ingestion and storage pipeline begins by aggregating metrics at defined policies and then storing and indexing them on M3DB cluster storage nodes.Based on our previous experiences running increasingly high metric storage workloads, we built M3 to:  
Optimize every part of the metrics pipeline, giving engineers as much storage as possible for the least amount of hardware spend.
Ensure data is as highly compressed as possible to reduce our hardware footprint, with a significant investment in optimizing Gorilla’s TSZ compression to further compress float64 values, which we call M3TSZ compression.
Keep a lean memory footprint for storage to avoid memory becoming a bottleneck, since a significant portion of each data point can be “write once, read never.” And, to keep access time fast, we maintain a Bloom filter and index summary per shard time window block in mmap’d memory, since we allow ad-hoc queries of up to 100,000 unique time series in a single query over long retention periods (in some cases, spanning years of retention).
Avoid compactions where possible, including the downsampling path, to increase the utilization of host resources for more concurrent writes and provide steady write/read latency.
Use a native design for time series storage that does not require vigilant operational attention to run with a high write volume. (We found that Cassandra required a lot of attention operationally amongst other components we were previously using).
Next, we discuss a few of the ways M3 accomplishes these goals through its architecture.
Figure 3: Queries return metrics from all regions by proxying requests through the remote regions’ coordinators which read their local regional storage and proxy results back.M3 provides a single global view of all metrics, avoiding the need for upstream consumers to navigate routing, thus increasing the overall simplicity of metrics discoverability. For example, a centralized, all-in-one view enables us to avoid installing multiple Grafana data sources for a single dashboard. Further, for workloads that failover applications between regions or workloads sharded across regions, the single global view makes it much easier to sum and query metrics across all regions in a single query. This lets users see all operations of a specific type globally, and look at a longer retention to view historical trends in a single place. 
To achieve this single pane view without the cost of cross-region replication, metrics are written in M3 to local regional M3DB instances. In this setup, replication is local to a region and can be configured to be isolated by availability zone or rack. Queries fan out to both the local region’s M3DB instances and coordinators in remote regions where metrics are stored, returning compressed M3TSZ blocks for matched time series wherever possible.
In future iterations, we’d like to engineer M3 to push query aggregations to remote regions to execute before returning results, as well as to the local M3DB storage node where possible. While rolling up the metrics before they are stored would be more ideal, not all query patterns are planned ahead of time, and metrics are increasingly being used in an ad hoc manner.
Fewer compactions
We found that Cassandra and other storage systems that compact blocks of data together spend a large fraction of system resources (CPU, memory, and disk IO) just rewriting stored data. Since we treat metrics as immutable, it did not make sense for us to run a system that wastes so many resources on compaction.  M3DB itself only compacts time-based data together when absolutely necessary, such as backfilling data or when it makes sense to combine time window index files together.
This strategy of reducing the need for compaction workloads is pervasive across the platform.  For instance consider downsampling, which, similar to compactions typically requires reading previously written data and computing an aggregate. This requires pulling large volumes of data, which results in a large tax on the network (which can be expensive with cloud providers), CPU (for RPC and for serialization/deserialization), and disk IO. Instead, with M3 we chose to downsample at the time of collection, thereby saving space and resources at ingestion time.  
Storage policies, retention, and downsampling
We don’t think that engineers should need to spend a lot of their time thinking about the lifecycle of their applications metrics. However, this behavior results in a lot of telemetry being collected that might not necessarily need to be stored for a long period of time. As we learned through our experience developing M3, an effective retention strategy keeps all metrics at a few reasonable, distinct retention periods by default and requires users to opt-in the subset of metrics that they want to retain for a longer period of time. 
M3’s metric storage policies define tag (label) matchers to associate retention and downsampling aggregation policies at either fine or coarse-grained levels. M3 lets engineers and data scientists define default rules that apply to all metrics, for instance:
Retention: 2 days, no downsampling
Retention: 30 days, downsampled to a one minute resolution
(And any others…)
For metrics kept at differing retentions or resolutions to the default rules, users can define specific storage policies, either through a UI or definitions under source control to specifically opt-in certain metrics.  As an example, a user could specify the following type of mapping rule, as follows:
  – name: Retain all disk_used metrics 

    filter: name:disk_used* device:sd*

    policies:

      – resolution: 10s

        retention:  2d

      – resolution: 1m

        retention:  30d

      – resolution: 1h

        retention:  5y
Rollups help reduce the latency of specific query access patterns and aggregate metrics across dimensions at ingestion time instead of query time. Prometheus uses recording rules to roll up metrics. Similarly, M3 has rollup rules that can be applied at ingestion time, allowing specific retention storage policies to be set for the resulting rolled up metrics.
 
Getting started with Prometheus and M3
To get started with M3 as a scalable Prometheus store, users need to set up an M3DB cluster to store and query metrics and install the M3 Coordinator sidecar on their Prometheus instances—that’s it! 
To learn more, check out the following documentation guides:
How to setup an M3DB cluster:Experiment with an M3DB single node deployment
Run an M3DB cluster deployment or run M3DB on Kubernetes
How to use M3 as a Prometheus remote storage backend:Integrate with Prometheus using the M3 Coordinator sidecar
The Prometheus sidecar M3 Coordinator writes to local regional M3DB instances and queries fan out to inter-regional M3 Coordinators which coordinate reads from their local regional M3DB instances.  
For metrics rollups, it is perhaps simplest to continue using Prometheus’ recording rules and just leverage M3’s storage policies to select different retentions and downsampling policies as desired for your metrics.
Right now, downsampling and matching metrics retention policies occurs in the M3 Coordinator sidecar, meaning that users cannot sum/aggregate metrics together at ingress across multiple Prometheus instances with M3 rollup rules. Instead, a user can only rollup metrics together from a single Prometheus instance. In the future, we would like to simplify the process of running a regional M3 aggregator cluster to let users create rollups across multiple Prometheus instances and leverage replicated downsampling using leader election on top of etcd.  This can be preferable to a typical Prometheus high availability setup in which two instances both run active-active and consequently write twice the samples to long-term storage (such as M3DB).
 
Help us out
M3DB and M3 Coordinator were built as open source projects from the beginning. Please submit pull requests, issues to the M3 monorepo, and proposals to M3’s proposal repository.  
To improve open source M3, we are actively working on a few things:
Bring M3 Coordinator and M3DB’s new reverse index out of beta
Release m3ctl UI for more simple authoring of retention and aggregation downsampling rules
Add turnkey support for StatsD and Graphite in open source M3
Solicit feedback and contributions from the community
We hope you’ll try out M3DB for your Prometheus-based metrics infrastructure and give us your feedback!
Gitter
Google Mailing List
GitHub Issues
If you’re interested tackling infrastructure challenges at scale, consider applying for a role on our team. Be sure to visit Uber’s official open source page for more information about M3 and other projects.
Subscribe to our newsletter to keep up with the latest innovations from Uber Engineering.