Uber Infrastructure in 2019: Improving Reliability, Driving Customer Satisfaction

Every day around the world, millions of trips take place across the Uber network, giving users more reliable transportation through ridesharing, bikes, and scooters, drivers and truckers additional opportunities to earn, employees and employers more convenient business travel, and hungry eaters quick and easy food delivery. 

Uber Infrastructure, an engineering ecosystem consisting of hundreds of thousands of machines supported by an organization of hundreds of engineers, powers these connections and the software requests that fulfill them in real-time.  With multiple products being leveraged by millions of drivers, riders, eaters, businesses, and vehicles, infrastructure is the manifestation of the very scale and clock speed of Uber. Since our infrastructure undergirds every single request that Uber handles, the scale of Uber Infrastructure’s challenges match the size of Uber itself. Similarly, the pace of software development at Uber is a function of how effectively our infrastructure exposes the foundational tools and resources which engineers use to build and power customer-facing features, from estimated time of arrival (ETA) prediction on Uber Eats to the seamless matching of carriers and shipments on Uber Freight.  

Given these demands, Uber Infrastructure is continually innovating new solutions to these challenges of scale and speed, enabling our platform to truly become the operating system of our customer’s lives

In 2019, Uber was focused on enhancing our infrastructure in four main areas, representing the broad and multifaceted nature of our domain:

    • Building out infrastructure to map to the unique needs of our ever-growing suite of products 
    • Building new solutions at scale to improve the resilience and reliability of the Uber platform  
    • Increasing Uber’s clock speed of software development by creating phenomenal interfaces and tooling for engineers 
    • Creating infrastructure enhancements that enable resource savings and efficiency gains

Read on to learn more about these initiatives: 

Building for product

The Uber platform processes millions of complex financial transactions per day, a function  naturally demanding an infrastructure which can support the reliable, inviolable, and regulation-compliant storage of transaction data. To address that challenge, our Integrated Storage Systems team built a storage gateway which can sit atop any number of storage engines and provide ledger-style functionality such as the signing and sealing of data, data encryption, strongly consistent indexes, and data offloading to cold storage. As a result of these requirements, we are able to not only validate the data integrity, but also unlock the potential for further innovation in how we handle transactions for our customers. 

To handle both the growing complexity and diversity of the very unique search needs of our various products (searching everything from points on the globe to attributes of cuisine) and the massive scale required to index and search across all of our data, our Search Infrastructure team built a unified search platform that provides high-performance search-as-a-service and indexing-as-a-service for many of our products at Uber-scale. These innovations have had an observable impact in our measurement of the customer experience. After moving to this platform, our Uber Eats product experienced the following improvements: search latencies reduced by 30 percent, stale data reduced from 20-30 percent to 0 percent, and a reduction of index size by 75 percent. And this was just the first version; our ability to provide accurate and rapid results to hungry customers searching for restaurants and dishes will only improve in the future.

Figure 1. Deployment of our unified search platform led to an observable improvement in search latency on Uber Eats.

Resilience and reliability

Much of our work in 2019 focused on building solutions to improve the resilience and reliability of our platform under various conditions. To this end, our Operational Storage team built a next-generation deployment and management platform for all of our stateful technologies, allowing automated and intelligent management of any storage clusters. This technology not only allows our engineers to quickly and easily manage their storage solutions in a unified manner, but also ensures high-availability and reliability of Uber’s storage layer through automation and auto-remediation, as well as state-driven scheduling and deployment of storage nodes.

Developer velocity

To improve both the productivity of our developers and the operational aspects of managing our infrastructure, we migrated hundreds of our Java-based services from their individual and independent codebases to a single, unified monorepo ). As a result, we were able to leverage the benefits of using a monorepo, such as the ability to more easily discover and extend code, utilize common and integrated tooling, better standardize frameworks and libraries, and more easily manage, automate, and improve our overall infrastructure. As an example of the benefits of this unification – a single config change, related to Java IPC handling, resulted in a 10% improvement in performance of all Java services across Uber. 

Efficiencies and resource savings

To handle the millions of requests we receive requires a massive amount of CPU resources. As a result, even the most modest gains can have significant impact at our scale, which is why we are always innovating better ways to manage our resource consumption. To ensure that our thousands of software services are being allocated the correct amount of CPU resources with high precision, we built a compute scaler for our compute scheduling platform. Utilizing historical CPU usage metrics and mathematical and statistical analysis, this scaler continuously determines the best allocations of CPU resources for any service, taking both safety and efficiency into consideration. This is one of the most effective ways we can use to reduce waste, improve our ability to serve requests at scale and ensure our ability to continue providing a robust platform for our customers. Since its introduction the scaler has already “recovered” tens of thousands of cores and the recommendations it provides for CPU right-sizing have been proven to be critical in preventing service degradations and outages. 

Moving forward

As we move into 2020, we will continue to excitedly focus on these key areas and continue innovating to match the challenge presented by our products and our scale. As the Uber platform grows, we look forward to further building and strengthening our infrastructure to support our next generation of products and becoming an even better operating system of our customer’s lives.

Comments