Skip to footer

Customer Support Automation Platform at Uber

High Level Overview of the Problem Introduction If you’ve used any online/digital service, chances are that you are familiar with what a typical customer service experience entails: you send a message (usually email aliased) to the company’s support staff, fill out a form, expect some back and forth with a customer service representative (CSR), and hopefully have your issue resolved. This...

Tuning Model Performance

Introduction Uber uses machine learning (ML) models to power critical business decisions. An ML model goes through many experiment iterations before making it to production. During the experimentation phase, data scientists or machine learning engineers explore adding features, tuning parameters, and running offline analysis or backtesting. We enhanced the platform to reduce the human toil and time in this stage,...

Elastic Distributed Training with XGBoost on Ray

Introduction Since we productionized distributed XGBoost on Apache Spark™ at Uber in 2017, XGBoost has powered a wide spectrum of machine learning (ML) use cases at Uber, spanning from optimizing marketplace dynamic pricing policies for Freight, improving times of arrival (ETA) estimation, fraud detection and prevention, to content discovery and recommendation for Uber Eats. However, as Uber has scaled, we have...

Continuous Integration and Deployment for Machine Learning Online Serving and Models

Introduction At Uber, we have witnessed a significant increase in machine learning adoption across various organizations and use-cases over the last few years. Our machine learning models are empowering a better customer experience, helping prevent safety incidents, and ensuring market efficiency, all in real time. The figure above is a high level view of CI/CD for models and service binary. One...

Efficient and Reliable Compute Cluster Management at Scale

Introduction Uber relies on a containerized microservice architecture. Our need for computational resources has grown significantly over the years, as a consequence of business’ growth. It is an important goal now to increase the efficiency of our computing resources. Broadly speaking, the efficiency efforts in compute cluster management involve scheduling more workloads on the same number of machines. This approach...

Handling Flaky Unit Tests in Java

Introduction to Flaky Tests Unit testing forms the bedrock of any Continuous Integration (CI) system. It warns software engineers of bugs in newly-implemented code and regressions in existing code, before it is merged. This ensures increased software reliability. It also improves overall developer productivity, as bugs are caught early in the software development lifecycle. Hence, building a stable and reliable...

The Evolution of Data Science Workbench

In October 2017, we published an article introducing Data Science Workbench (DSW), our custom, all-in-one toolbox for data science, complex geospatial analytics, and exploratory machine learning. It centralizes everything required to perform data preparation, ad-hoc analyses, model prototyping, workflow scheduling, dashboarding, and collaboration in a single-pane, web-based graphical user interface.  In this article, we reflect on the evolution of DSW...

Scaling of Uber’s API gateway

As a recap from the last article, Uber’s API Gateway provides an interface and acts as a single point of access for all of our back-end services to expose features and data to Mobile and 3rd party partners. Two major components for a system like API Gateway are configuration management and runtime. The runtime component is responsible for authenticating,...

Fraud Detection: Using Relational Graph Learning to Detect Collusion

As Uber grew in popularity and scale among legitimate customers, it also attracted the attention of financial criminals in the cyberspace. One type of fraudulent behavior is collusion, a cooperative fraud action among users. For example, users collude by taking fake trips with stolen credit cards resulting in chargeback (a bank-initiated refund for a credit card purchase). In this...

The Architecture of Uber’s API gateway

API gateways are an integral part of microservices architecture in recent years. An API gateway provides a single point of entry for all our apps and provides an interface to access data, logic, or functionality from back-end microservices. It also provides a centralized place to implement many high-level responsibilities, including routing, protocol conversion, rate limiting, load shedding, header enrichment...

Introducing Orbit, An Open Source Package for Time Series Inference and Forecasting

Orbit is a general interface for Bayesian time series modeling. The goal of Orbit development team is to create a tool that is easy to use, flexible, interitible, and high performing (fast computation). Under the hood, Orbit uses the probabilistic programming languages (PPL) including but not limited to Stan and Pyro for posterior approximation (i.e, MCMC sampling, SVI). Below...

pprof++: A Go Profiler with Hardware Performance Monitoring

Motivation for a Better Go Profiler Golang is the lifeblood of thousands of Uber’s back-end services, running on millions of CPU cores. Understanding our CPU bottlenecks is critical, both for reducing service latencies and also for making our compute fleet efficient. The scale at which Uber operates demands in-depth insights into codes and microarchitectural implications. While the built-in Go profiler is...

Optimal Feature Discovery: Better, Leaner Machine Learning Models Through Information Theory

Introduction  Suppose you own a production ML model that already works reasonably well. You know that adding relevant and diverse sources of signal to your model is a sure way to boost performance, but finding new features that actually improve performance can be a slow and tedious process of trial and error.  At the start of your search, you might look...

Automating Merchant Live Monitoring with Real-Time Analytics: Charon

At Uber, live monitoring and automation of Ops is critical to preserve marketplace health, maintain reliability, and gain efficiency in markets. By the virtue of the word “live”, this monitoring needs to show what is happening now, with prompt access to fresh data, and the ability to recommend appropriate actions based on that data. Uber’s data platform provides the...

Freight Pricing with a Controlled Markov Decision Process

Intro Uber Freight was launched in 2017 to revolutionize the business of matching shippers and carriers in the huge and inefficient freight trucking industry (around $800B annual spend in the US). We believe, and have demonstrated, that a technology-first freight broker and marketplace can provide better opportunities to carriers, and superior outcomes to shippers and communities alike.  One of the wasteful...

Flipr: Making Changes Quickly and Safely at Scale

Introduction Uber’s many software systems require a high volume of changes every day. Because of our systems’ size and complexity, it is a significant challenge to implement these changes without unintended consequences, ultimately slowing down developer productivity. Flipr is a big part of Uber’s solution to solving this problem. Flipr is a tool that we created for dynamic configuration management,...

Uber’s Journey Toward Better Data Culture From First Principles

Data powers Uber Uber has revolutionized how the world moves by powering billions of rides and deliveries connecting millions of riders, businesses, restaurants, drivers, and couriers. At the heart of this massive transportation platform is Big Data and Data Science that powers everything that Uber does, such as better pricing and matching, fraud detection, lowering ETAs, and experimentation. Petabytes of...

Navigating to the Technical Program Management and Learning Team

Spread across 4 continents, the Technical Strategy, Program Management, and Learning team is composed of Technical Program Managers (TPMs), Technical Writers, Technical Strategists, and Technical Training Program Managers. Uber TPMs play a critical role in executing high-impact, company-wide initiatives and continuously improving processes to increase the effectiveness of our Product and Engineering organizations. On the Learning side, Program Managers and Technical Writers increase...

Elastic Deep Learning with Horovod on Ray

Introduction In 2017, we introduced Horovod, an open source framework for scaling deep learning training across hundreds of GPUs in parallel.  At the time, most of the deep learning use cases at Uber were related to the research and development of self-driving vehicles, while in Michelangelo, the vast majority of production machine learning models were tree models based on XGBoost. Now...

Applying Machine Learning in Internal Audit with Sparsely Labeled Data

As machine learning continues to evolve, transforming the various industries it touches, it has only begun to inform the world of audit. As a data scientist and former CPA Auditor, I can understand why this is the case. By nature, auditing is a field that focuses on the fine details and investigates any exceptions, while machine learning typically seeks...