Optimizing Observability with Jaeger, M3, and XYS at Uber

When something goes wrong with a piece of code, engineers want to know all the relevant details of the error immediately so they can get right to work remedying the malfunction. 

However, as technology has advanced, measuring system metrics and sending out alerts has only gotten more complicated: many organizations process millions or billions of requests per day, and their systems are distributed, which means that they rely on an intricate web of microservices to operate. This is the case at Uber, where we use over 4,000 microservices to power our multifaceted and often interdependent services. 

Uber’s observability team helps us keep our microservice architecture performant and reliable at scale. Three observability tools we use are Jaeger, an open source distributed tracing system created at Uber, XYS, an internal sampling service, and M3, our open source metrics stack. All three are important to Uber’s observability workflow and help run our services for users worldwide. 

Below, we highlight three videos covering presentations from our observability engineers on these three important tools and what other engineers can learn from them: 

Jaeger: Distributed Tracing at Uber

 

Observability software engineer Bill Westlin presents the fundamentals of Jaeger, Uber’s open source distributed tracing system at Uber Open Summit 2018 in San Francisco. He describes what motivated his team to build Jaeger and how they did it. Bill notes that conventional metrics demonstrate that something is wrong with a system, but they don’t offer any explanatory power. Logs offer a great deal of information, but with concurrent requests and multiple hosts, it can be nearly impossible to correlate a log with a given glitch, as both are missing context. In contrast, Jaeger’s distributed tracing follows requests from one service to another, composing a narrative of what happened and what went wrong (or right). This makes it much easier to pinpoint causation and resolve the true issue. Next, Bill describes Jaeger’s architecture, including its back-end components, pluggable storage, web UI, asynchronous ingestion, indexing practices, and more. He concludes by describing what’s on the horizon for Jaeger, such as incomplete span support and a data analysis platform.

XYS: Sampling Strategies

 

During an Uber meetup in 2018, Jaeger team engineer Won Jun Jang begins by covering distributed tracing basics and discussing various sampling strategies related to Uber’s distributed tracing system. Jaeger, he explains, prevents engineers from having to dive into a rabbit hole of dependencies to solve a single technical issue; its visualized traces and intuitive UI make diagnosing a problem trivial. Engineers can also use these powerful Jaeger traces to assess their services and systems as a whole. However, this becomes trickier when one considers exactly how to sample this information. Won discusses the benefits and pitfalls of rate-limiting, probabilistic, adaptive, and systematic sampling. Next, he demos XYS (Examine Your Service), Uber’s internal tool for using Jaeger trace samples to extrapolate latency, traffic, and dependency data. He concludes by showing how XYS allows users to compare representative samples to comprehend anomalies in data and resolve systemic issues. 

M3: Metrics at Uber

 

Engineer Celina Ward discusses M3, Uber’s open source metrics stack, and M3DB, Uber’s open source distributed time series database, at Uber Open Summit Sofia 2019. First, she defines high-dimensionality metrics, which are data tracked over time that have many different aspects (in Uber’s case, this might be the route they’re related to, the region they occurred in, their status codes, etc.). She explains that high-dimensionality metrics can be costly and difficult to use, especially at Uber’s scale, since just one emission could lead to 100 million unique time series. However, they are often critical to business and software development. That’s why Celina’s team created M3 and its accompanying database. These resources allow Uber to store and utilize high-dimensionality metrics without exorbitant expense or effort. She goes on to provide a case study of how these tools work at Uber, discuss their history, describe future goals for further development, and provide advice for other organizations using M3 and M3DB. 

Check out our other articles on Jaeger and M3 to learn more about these technologies – and look out for future articles on XYS as this tool advances! 

Interested in furthering the future of observability at Uber? Apply for a role on our team!

Comments