Skip to footer

Better Load Balancing: Real-Time Dynamic Subsetting

Overview Subsetting is a common technique used in load balancing for large-scale distributed systems. In this blog post, we will briefly introduce Uber’s current service mesh architecture that has been powering thousands of critical microservices in Uber since 2016. We will then discuss the challenges we faced when trying to scale the number of tasks in the mesh and issues...

Dynamic Data Race Detection in Go Code

Uber has extensively adopted Go as a primary programming language for developing microservices. Our Go monorepo consists of about 50 million lines of code and contains approximately 2,100 unique Go services. Go makes concurrency a first-class citizen; prefixing function calls with the go keyword runs the call asynchronously. These asynchronous function calls in Go are called goroutines. Developers hide...

Presto® on Apache Kafka® At Uber Scale

Uber’s goal is to ignite opportunity by setting the world in motion, and big data is a very important part of that. Presto® and Apache Kafka® play critical roles in Uber’s big data stack. Presto is the de facto standard for query federation that has been used for interactive queries, near-real-time data analysis, and large-scale data analysis. Kafka is...

Securing Kafka® Infrastructure at Uber

Background Uber has one of the largest deployments of Apache Kafka® in the world. It empowers a large number of real-time workflows at Uber, including pub-sub message buses for passing event data from the rider and driver apps, as well as financial transaction events between the backend services. As Kafka forms a critical component of Uber’s core workflows, it is important...

Uber’s Emergency Button and The Technologies Behind It

Safety has long been a top priority at Uber, as Uber’s CEO Dara Khosrowshahi wrote in ‘Raising the Bar on Safety’ in September 2018. In order to #StandForSafety, the team at Uber has rolled out a set of features, such as Safety Center, Trusted Contacts, and the in-app Emergency Button, among others. The first version of Emergency Button was rolled...

Avoiding CPU Throttling in a Containerized Environment

At Uber, all stateful workloads run on a common containerized platform across a large fleet of hosts. Stateful workloads include MySQL®, Apache Cassandra®, ElasticSearch®, Apache Kafka®, Apache HDFS™, Redis™, Docstore, Schemaless, etc., and in many cases these workloads are co-located on the same physical hosts.  With 65,000 physical hosts, 2.4 million cores, and 200,000 containers, increasing utilization to reduce cost...

One Stone, Three Birds: Finer-Grained Encryption @ Apache Parquet™

Overview  Data access restrictions, retention, and encryption at rest are fundamental security controls. This blog explains how we have built and utilized open-sourced Apache Parquet™'s finer-grained encryption feature to support all 3 controls in a unified way. In particular, we will focus on the technical challenges of designing and applying encryption in a secure, reliable, and efficient manner. We will...

Introducing Ballast: An Adaptive Load Test Framework

As Uber's architecture has grown to encompass thousands of interdependent microservices, we need to test our mission-critical components at max load in order to preserve reliability. Accurate load testing allows us to validate if a set of services are working at peak usage and optimal efficiency while retaining reliability. Load testing those services within a short time frame comes with...

Introducing Carbon Feed for Earners: The One-Stop Info Shop

After launching the Driver App in 2018 to over 2 million earners worldwide, we added content and functionality at a rapid pace. Although this really bolstered the platform, allowing for high-density and high-frequency content, and provided drivers and couriers with new ways to earn, it came with significant costs. For earners, it became increasingly difficult to find the information...

DeepETA: How Uber Predicts Arrival Times Using Deep Learning

At Uber, magical customer experiences depend on accurate arrival time predictions (ETAs). We use ETAs to calculate fares, estimate pickup times, match riders to drivers, plan deliveries, and more. Traditional routing engines compute ETAs by dividing up the road network into small road segments represented by weighted edges in a graph. They use shortest-path algorithms to find the best...

Project RADAR: Intelligent Early Fraud Detection System with Humans in the Loop

Introduction Uber is a worldwide marketplace of services, processing thousands of monetary transactions every second. As a marketplace, Uber takes on all of the risks associated with payment processing. Uber partners who use the marketplace to provide services are paid for their work even if Uber was unable to collect the payment. Fraud response is thus a very important operational...

Cost Efficiency @ Scale in Big Data File Format

  Background Our Apache Hadoop® based data platform ingests hundreds of petabytes of analytical data with minimum latency and stores it in a data lake built on top of the Hadoop Distributed File System (HDFS). We use Apache Hudi™ as our ingestion table format and Apache Parquet™ as the underlying file format. Our data platform leverages Apache Hive™, Apache Presto™, and Apache Spark™ for...

Capacity Recommendation Engine: Throughput and Utilization Based Predictive Scaling

Introduction Capacity is a key component of reliability. Uber's services require enough resources in order to handle daily peak traffic and to support our different kinds of business units. These services are deployed across different cloud platforms and data centers (“zones”). With manual capacity management, it often results in an over-provisioned capacity, which is insufficient for resource usage. Uber built...

The New Version of Orbit (v1.1) is Released: The Improvements, Design Changes, and Exciting Collaborations

Introduction The previous post gave an overview of Orbit, a Python package developed by Uber in order to perform Bayesian time-series analysis and forecasting. This post provides the details of the version 1.1 updates—in particular, changes in syntax of calling models, the new classes design, and the KTR (Kernel Time-varying Regression) model. Some news about external interests and additional use...

How We Saved 70K Cores Across 30 Mission-Critical Services (Large-Scale, Semi-Automated Go GC Tuning @Uber)

Introduction As part of Uber engineering’s wide efforts to reach profitability, recently our team was focused on reducing cost of compute capacity by improving efficiency. Some of the most impactful work was around GOGC optimization. In this blog we want to share our experience with a highly effective, low-risk, large-scale, semi-automated Go GC tuning mechanism. Uber’s tech stack is composed of...

Cadence Multi-Tenant Task Processing

Introduction Cadence is a multi-tenant orchestration framework that helps developers at Uber to write fault-tolerant, long-running applications, also known as workflows. It scales horizontally to handle millions of concurrent executions from various customers. It is currently used by hundreds of different teams within Uber to support their business logic. As we onboard different use cases to Cadence, resource isolation became...

CRISP: Critical Path Analysis for Microservice Architectures

Uber’s backend is an exemplar of microservice architecture. Each microservice is a small, individually deployable program performing a specific business logic (operation). The microservice architecture is a type of distributed computing system, which is suitable for independent deployments and scaling of software programs, and so is widely used across modern service-oriented industries. Uber has a few thousand microservices interacting...

How Uber Migrated Financial Data from DynamoDB to Docstore

Introduction Each day, Uber moves millions of people around the world and delivers tens of millions of food and grocery orders. This generates a large number of financial transactions that need to be stored with provable completeness, consistency, and compliance.   LedgerStore is an immutable, ledger-style database storing business transactions. LedgerStore provides signing/sealing of data to guarantee data completeness/correctness, strongly consistent indexes,...

Introducing uGroup: Uber’s Consumer Management Framework

Background Apache Kafka® is widely used across Uber’s multiple business lines. Take the example of an Uber ride: When a user opens up the Uber app, demand and supply data are aggregated in Kafka queues to serve fare calculations. When a ride request is accepted by a driver, push notifications in Kafka queue are sent to mobile devices. After a ride...

Improving HDFS I/O Utilization for Efficiency

Scaling our data infrastructure with lower hardware costs while maintaining high performance and service reliability has been no easy feat. To accommodate the exponential growth in both Data Storage and Analytics Compute at Uber, the Data Infrastructure team massively overhauled its approach in scaling the Apache Hadoop® Data File System (HDFS) by re-architecting the software layer in conjunction with...