Skip to main content
Uber logo

Schedule rides in advance

Reserve a rideReserve a ride

Schedule rides in advance

Reserve a rideReserve a ride
Data / ML, Engineering

Uber’s Big Data Platform: 100+ Petabytes with Minute Latency

October 17, 2018 / Global
Featured image for Uber’s Big Data Platform: 100+ Petabytes with Minute Latency
Figure 1: Before 2014, the total amount of data stored at Uber was small enough to fit into a few traditional OLTP databases. There was no global view of the data, and data access was fast since each database was queried directly.
Figure 2: The first generation of Uber’s Big Data platform allowed us to aggregate all of Uber’s data in one place and provide standard SQL interface for users to access data.
Figure 3: The second generation of our Big Data platform leveraged Hadoop to enable horizontal scaling. Incorporating technologies such as Parquet, Spark, and Hive, tens of petabytes of data was ingested, stored, and served.
Figure 4: While Hadoop enabled the storage of several petabytes of data in our Big Data platform, the latency for new data was still over one day, a lag due to the snapshot-based ingestion of large, upstream source tables that take several hours to process.
Figure 5: The third generation of our Big Data platform incorporates faster, incremental data ingestion (using our open source framework), as well as more efficient storage and serving of data via our open source library.
Figure 6: A raw table that is being updated through Hudi writer can be read in two different modes: the latest mode view returning the latest value for all records and the incremental mode view returning only the updated records since last read.
Figure 7: Standardizing our Hive data model improved data quality for our entire Big Data ecosystem. This model incorporates a merged snapshot table containing the latest values for each row_key as well as a changelog history table containing the history of all value changes per each row_key.
Figure 8: Building a more extensible data transfer platform allowed us to easily aggregate all data pipelines in a standard way under one service as well as support any-to-any connectivity between any data source and data sink.
Reza Reza

Reza Reza

Reza Shiftehfar currently leads Uber’s Hadoop Platform team. His team helps build and grow Uber’s reliable and scalable Big Data platform that serves petabytes of data utilizing technologies such as Apache Hadoop, Apache Hive, Apache Kafka, Apache Spark, and Presto. Reza is one of the founding engineers of Uber’s data team and helped scale Uber's data platform from a few terabytes to over 100 petabytes while reducing data latency from 24+ hours to minutes.

Posted by Reza Reza