Skip to footer

Efficient and Reliable Compute Cluster Management at Scale

Introduction Uber relies on a containerized microservice architecture. Our need for computational resources has grown significantly over the years, as a consequence of business’ growth. It is an important goal now to increase the efficiency of our computing resources. Broadly speaking, the efficiency efforts in compute cluster management involve scheduling more workloads on the same number of machines. This approach...

Handling Flaky Unit Tests in Java

Introduction to Flaky Tests Unit testing forms the bedrock of any Continuous Integration (CI) system. It warns software engineers of bugs in newly-implemented code and regressions in existing code, before it is merged. This ensures increased software reliability. It also improves overall developer productivity, as bugs are caught early in the software development lifecycle. Hence, building a stable and reliable...

The Evolution of Data Science Workbench

In October 2017, we published an article introducing Data Science Workbench (DSW), our custom, all-in-one toolbox for data science, complex geospatial analytics, and exploratory machine learning. It centralizes everything required to perform data preparation, ad-hoc analyses, model prototyping, workflow scheduling, dashboarding, and collaboration in a single-pane, web-based graphical user interface.  In this article, we reflect on the evolution of DSW...

Scaling of Uber’s API gateway

As a recap from the last article, Uber’s API Gateway provides an interface and acts as a single point of access for all of our back-end services to expose features and data to Mobile and 3rd party partners. Two major components for a system like API Gateway are configuration management and runtime. The runtime component is responsible for authenticating,...

Fraud Detection: Using Relational Graph Learning to Detect Collusion

As Uber grew in popularity and scale among legitimate customers, it also attracted the attention of financial criminals in the cyberspace. One type of fraudulent behavior is collusion, a cooperative fraud action among users. For example, users collude by taking fake trips with stolen credit cards resulting in chargeback (a bank-initiated refund for a credit card purchase). In this...

The Architecture of Uber’s API gateway

API gateways are an integral part of microservices architecture in recent years. An API gateway provides a single point of entry for all our apps and provides an interface to access data, logic, or functionality from back-end microservices. It also provides a centralized place to implement many high-level responsibilities, including routing, protocol conversion, rate limiting, load shedding, header enrichment...

Introducing Orbit, An Open Source Package for Time Series Inference and Forecasting

Orbit is a general interface for Bayesian time series modeling. The goal of Orbit development team is to create a tool that is easy to use, flexible, interitible, and high performing (fast computation). Under the hood, Orbit uses the probabilistic programming languages (PPL) including but not limited to Stan and Pyro for posterior approximation (i.e, MCMC sampling, SVI). Below...

pprof++: A Go Profiler with Hardware Performance Monitoring

Motivation for a Better Go Profiler Golang is the lifeblood of thousands of Uber’s back-end services, running on millions of CPU cores. Understanding our CPU bottlenecks is critical, both for reducing service latencies and also for making our compute fleet efficient. The scale at which Uber operates demands in-depth insights into codes and microarchitectural implications. While the built-in Go profiler is...

Optimal Feature Discovery: Better, Leaner Machine Learning Models Through Information Theory

Introduction  Suppose you own a production ML model that already works reasonably well. You know that adding relevant and diverse sources of signal to your model is a sure way to boost performance, but finding new features that actually improve performance can be a slow and tedious process of trial and error.  At the start of your search, you might look...

Automating Merchant Live Monitoring with Real-Time Analytics: Charon

At Uber, live monitoring and automation of Ops is critical to preserve marketplace health, maintain reliability, and gain efficiency in markets. By the virtue of the word “live”, this monitoring needs to show what is happening now, with prompt access to fresh data, and the ability to recommend appropriate actions based on that data. Uber’s data platform provides the...

Freight Pricing with a Controlled Markov Decision Process

Intro Uber Freight was launched in 2017 to revolutionize the business of matching shippers and carriers in the huge and inefficient freight trucking industry (around $800B annual spend in the US). We believe, and have demonstrated, that a technology-first freight broker and marketplace can provide better opportunities to carriers, and superior outcomes to shippers and communities alike.  One of the wasteful...

Flipr: Making Changes Quickly and Safely at Scale

Introduction Uber’s many software systems require a high volume of changes every day. Because of our systems’ size and complexity, it is a significant challenge to implement these changes without unintended consequences, ultimately slowing down developer productivity. Flipr is a big part of Uber’s solution to solving this problem. Flipr is a tool that we created for dynamic configuration management,...

Uber’s Journey Toward Better Data Culture From First Principles

Data powers Uber Uber has revolutionized how the world moves by powering billions of rides and deliveries connecting millions of riders, businesses, restaurants, drivers, and couriers. At the heart of this massive transportation platform is Big Data and Data Science that powers everything that Uber does, such as better pricing and matching, fraud detection, lowering ETAs, and experimentation. Petabytes of...

Navigating to the Technical Program Management and Learning Team

Spread across 4 continents, the Technical Strategy, Program Management, and Learning team is composed of Technical Program Managers (TPMs), Technical Writers, Technical Strategists, and Technical Training Program Managers. Uber TPMs play a critical role in executing high-impact, company-wide initiatives and continuously improving processes to increase the effectiveness of our Product and Engineering organizations. On the Learning side, Program Managers and Technical Writers increase...

Elastic Deep Learning with Horovod on Ray

Introduction In 2017, we introduced Horovod, an open source framework for scaling deep learning training across hundreds of GPUs in parallel.  At the time, most of the deep learning use cases at Uber were related to the research and development of self-driving vehicles, while in Michelangelo, the vast majority of production machine learning models were tree models based on XGBoost. Now...

Applying Machine Learning in Internal Audit with Sparsely Labeled Data

As machine learning continues to evolve, transforming the various industries it touches, it has only begun to inform the world of audit. As a data scientist and former CPA Auditor, I can understand why this is the case. By nature, auditing is a field that focuses on the fine details and investigates any exceptions, while machine learning typically seeks...

How Uber Deals with Large iOS App Size

The App Size Problem Uber’s iOS mobile Apps for Rider, Driver, and Eats are large in size. The choice of Swift as our primary programming language, our fast-paced development environment and feature additions, layered software and its dependencies, and statically linked platform libraries result in large app binaries. Reducing application size is critical to our customer experience. Moreover, Apple’s app-download-size...

Evolving Schemaless into a Distributed SQL Database

Introduction In 2016 we published blog posts (I, II) about Schemaless - Uber Engineering’s Scalable Datastore. We went over the design of Schemaless as well as explained the reasoning behind developing it. In this post today we are going to talk about the evolution of Schemaless into a general-purpose transactional database called Docstore.  Docstore is a general-purpose multi-model database that provides...

Fast and Reliable Schema-Agnostic Log Analytics Platform

At Uber, we provide a centralized, reliable, and interactive logging platform that empowers engineers to work quickly and confidently at scale. The logs are tagged with a rich set of contextual key value pairs, with which engineers can slice and dice their data to surface abnormal or interesting patterns that can guide product improvement. Right now, the platform is...

Uber’s Real-time Data Intelligence Platform At Scale: Improving Gairos Scalability/Reliability

Background Real-time data (# of ride requests, # of drivers available, weather, game) enables operations teams to make informed decisions like surge pricing, maximum dispatch ETA calculating, and demand/supply forecasting about our services that improve user experiences on the Uber platform. While batched data can provide powerful insights by identifying medium-term and long-term trends, Uber services can combine streaming data...

Popular Articles