Skip to footer

Cost Efficiency @ Scale in Big Data File Format

  Background Our Apache Hadoop® based data platform ingests hundreds of petabytes of analytical data with minimum latency and stores it in a data lake built on top of the Hadoop Distributed File System (HDFS). We use Apache Hudi™ as our ingestion table format and Apache Parquet™ as the underlying file format. Our data platform leverages Apache Hive™, Apache Presto™, and Apache Spark™ for...

Capacity Recommendation Engine: Throughput and Utilization Based Predictive Scaling

Introduction Capacity is a key component of reliability. Uber's services require enough resources in order to handle daily peak traffic and to support our different kinds of business units. These services are deployed across different cloud platforms and data centers (“zones”). With manual capacity management, it often results in an over-provisioned capacity, which is insufficient for resource usage. Uber built...

The New Version of Orbit (v1.1) is Released: The Improvements, Design Changes, and Exciting Collaborations

Introduction The previous post gave an overview of Orbit, a Python package developed by Uber in order to perform Bayesian time-series analysis and forecasting. This post provides the details of the version 1.1 updates—in particular, changes in syntax of calling models, the new classes design, and the KTR (Kernel Time-varying Regression) model. Some news about external interests and additional use...

How We Saved 70K Cores Across 30 Mission-Critical Services (Large-Scale, Semi-Automated Go GC Tuning @Uber)

Introduction As part of Uber engineering’s wide efforts to reach profitability, recently our team was focused on reducing cost of compute capacity by improving efficiency. Some of the most impactful work was around GOGC optimization. In this blog we want to share our experience with a highly effective, low-risk, large-scale, semi-automated Go GC tuning mechanism. Uber’s tech stack is composed of...

Cadence Multi-Tenant Task Processing

Introduction Cadence is a multi-tenant orchestration framework that helps developers at Uber to write fault-tolerant, long-running applications, also known as workflows. It scales horizontally to handle millions of concurrent executions from various customers. It is currently used by hundreds of different teams within Uber to support their business logic. As we onboard different use cases to Cadence, resource isolation became...

CRISP: Critical Path Analysis for Microservice Architectures

Uber’s backend is an exemplar of microservice architecture. Each microservice is a small, individually deployable program performing a specific business logic (operation). The microservice architecture is a type of distributed computing system, which is suitable for independent deployments and scaling of software programs, and so is widely used across modern service-oriented industries. Uber has a few thousand microservices interacting...

How Uber Migrated Financial Data from DynamoDB to Docstore

Introduction Each day, Uber moves millions of people around the world and delivers tens of millions of food and grocery orders. This generates a large number of financial transactions that need to be stored with provable completeness, consistency, and compliance.   LedgerStore is an immutable, ledger-style database storing business transactions. LedgerStore provides signing/sealing of data to guarantee data completeness/correctness, strongly consistent indexes,...

Introducing uGroup: Uber’s Consumer Management Framework

Background Apache Kafka® is widely used across Uber’s multiple business lines. Take the example of an Uber ride: When a user opens up the Uber app, demand and supply data are aggregated in Kafka queues to serve fare calculations. When a ride request is accepted by a driver, push notifications in Kafka queue are sent to mobile devices. After a ride...

Improving HDFS I/O Utilization for Efficiency

Scaling our data infrastructure with lower hardware costs while maintaining high performance and service reliability has been no easy feat. To accommodate the exponential growth in both Data Storage and Analytics Compute at Uber, the Data Infrastructure team massively overhauled its approach in scaling the Apache Hadoop® Data File System (HDFS) by re-architecting the software layer in conjunction with...

Building Uber’s Fulfillment Platform for Planet-Scale using Google Cloud Spanner

  Introduction The Fulfillment Platform is a foundational Uber domain that enables the rapid scaling of new verticals. The platform handles billions of database transactions each day, ranging from user actions (e.g., a driver starting a trip) and system actions (e.g., creating an offer to match a trip with a driver) to periodic location updates (e.g., recalculating eligible products for a...

Real-Time Exactly-Once Ad Event Processing with Apache Flink, Kafka, and Pinot

Uber recently launched a new capability: Ads on UberEats. With this new ability came new challenges that needed to be solved at Uber, such as systems for ad auctions, bidding, attribution, reporting, and more. This article focuses on how we leveraged open source technology to build Uber’s first “near real-time” exactly-once events processing system. We’ll dive into the details...

YAML Generator for Funnel YAML Files: Streamlining the Mobile Data Workflow Process

At Uber, real-time mobile analytics events—generated by button taps, page views, and more—form the backbone of the mobile data workflow process. To process these events, our Mobile Data Platform Team designed and developed the Fontana library, which converts the nearly-one-million-QPS (queries per second) volume of events into easily digestible and useful analytics for Uber engineers. As part of this process,...

Jellyfish: Cost-Effective Data Tiering for Uber’s Largest Storage System

Problem Uber deploys a few storage technologies to store business data based on their application model. One such technology is called Schemaless, which enables the modeling of related entries in one single row of multiple columns, as well as versioning per column. Schemaless has been around for a couple of years, amassing Uber’s data. While Uber is consolidating all the use...

Streaming Real-Time Analytics with Redis, AWS Fargate, and Dash Framework

Introduction Uber’s GSS (Global Scaled Solutions) team runs scaled programs for diverse products and businesses, including but not limited to Eats, Rides, and Freight. The team transforms Uber’s ideas into agile, global solutions by designing and implementing scalable solutions. One of the areas of expertise within GSS is the Digitization vertical. The Digitization team efficiently converts physical signals into digital...

Enabling Seamless Kafka Async Queuing with Consumer Proxy

Uber has one of the largest deployments of Apache Kafka in the world, processing trillions of messages and multiple petabytes of data per day. As Figure 1 shows, today we position Apache Kafka as a cornerstone of our technology stack. It empowers a large number of different workflows, including pub-sub message buses for passing event data from the rider...

How Data Shapes the Uber Rider App

Introduction Data is crucial for our products. Data analytics help us provide a frictionless experience to the people that use our services. It also enables our engineers, product managers, data analysts, and data scientists to make informed decisions. The impact of data analysis can be seen in every screen of our app: what is displayed on the home screen, the...

Building Scalable Streaming Pipelines for Near Real-Time Features

Background Uber is committed to providing reliable services to customers across our global markets. To achieve this, we heavily rely on machine learning (ML) to make informed decisions like forecasting and surge. As a result, real-time streaming pipelines, which are used to generate the data and features for ML, have become more popular and important. At Uber, we leverage Apache Flink...

Eats Safety Team On-Call Overview

Introduction Our engineers have the responsibility of ensuring a consistent and positive experience for our riders, drivers, eaters, and delivery/restaurant partners. Ensuring such an experience requires reliable systems: our apps have to work when anyone needs them. A major component of reliability is having engineers on call to deal with problems immediately as they arise. We set up our on-call engineers...

Unifying Support Content to Enable More Empathetic and Personalized Customer Support Experiences

Introduction  Content quality is critical to the support experienced by Uber’s customers. Consider an Eater who reached out for help to cancel a very delayed order. The same resolution, such as refunding the charge, can be delivered alongside a robotic-sounding message, or one where the style and tone of the response conveys true empathy and acknowledges the user’s poor experience...

Efficiently Managing the Supply and Demand on Uber’s Big Data Platform

With Uber’s business growth and the fast adoption of big data and AI, Big Data scaled to become our most costly infrastructure platform. To reduce operational expenses, we developed a holistic framework with 3 pillars: platform efficiency, supply, and demand (using supply to describe the hardware resources that are made available to run big data storage and compute workload,...

Popular Articles