Pedestrian density map

At Uber ATG, developing a safe self-driving car system not only means training it on the typical traffic scenarios we see every day, but also the edge cases, those more difficult and rare situations that would even flummox a human driver. To make sure we aren’t just training our system exclusively on the most common traffic scenarios, we developed a search engine that lets our engineers look for those rarer instances among terabytes of historical sensor data. 

An important component of developing self-driving cars involves humans driving along city streets, using radar, LiDAR, cameras, and other sensors to collect data. The data gathered by these human-driven cars not only shows what road infrastructure looks like, but also the complex interactions of vehicles, pedestrians, and other actors. The traffic scenarios we derive from this data capture the typical, such as a crowd of pedestrians crossing the street, to more difficult edge cases, like cars caught in an intersection after the light changes.

We use these traffic scenarios to develop machine learning models that help our self-driving cars safely react to common, and not so common, scenarios that come up in a given operational domain

Leveraging projects such as Apache Hive, Apache Spark, and Apache Hadoop used in Uber’s Big Data platform, we created infrastructure that lets us search our data for these less common scenarios which might occur. With this system, ATG developers can query the dataset to refine our models by training them on those difficult cases and scenarios.

A solid data foundation

One of the keys to quickly iterate on gathering training data for machine learning is to have a robust and scalable data solution that can run complex queries efficiently. As a result, we developed the ATG Analytics Platform, which contains all of our labelled data in modeled tables so we can query our data. 

This data describes actors in traffic scenarios including bicyclists, pedestrians, and different vehicle types, so a query could ask for scenarios where bicyclists are present. We also cover specific types of traffic movement, so, for example, we can focus on situations involving left turns, and road geometries, which allows even greater query granularity. These specific scenarios can then be used to train our self-driving cars to safely navigate a traffic situation with bicyclists.

The ATG Analytics Platform was built with the following principles in mind:

  • Searchable: easy to locate available data and formulate a query against this data
  • Dependable: defined quality criteria and predictable service-level agreements
  • Durable: available for long periods of time to do temporal comparisons and analyze trends
  • Scalable: capable of handling intricate queries against large datasets

The ATG Analytics Platform uses a dedicated Apache Hadoop cluster to provide sufficient storage and compute capacity. We deployed Apache Hive in addition to Apache Spark and Apache Hadoop to support Hive-based analytics. We also built a library to efficiently onboard data to a Hadoop Distributed File System (HDFS). This library allows developers to add data publishers anywhere in their workflows. After raw data lands in HDFS, we have extract, transform, and load (ETL) pipelines to promote these raw data into modeled tables.

Modeled tables are crucial in making our data useful for training self-driving cars to operate safely. Instead of writing custom ETL pipelines and convoluted functions, autonomy engineers can query modeled tables in a fast and straightforward way, finding the types of scenarios they need to iterate on training. We adopted the dimensional modeling paradigm as our data modeling methodology, allowing us to strike a balance between storage and computation costs.

Having trustworthy data is the first step to successful modeling. The second part is to have a set of battle-tested analytical tools available in our daily workflows. These tools provide web and programmatic interfaces that allow engineers and data scientists to write SQL queries and Spark jobs against our data warehouse.

For example, when processing ground truth-labeled data, a pipeline scans every log, frame-by-frame, extracting data at a frequency of 10 frames per second. Each frame includes fields like the 3D coordinates of all the labels in the scene, the label identifier, and the car’s latitude and longitude. A data producer can incorporate these fields and then partition and publish them to HDFS using our internal Python library. After the data lands in HDFS, an ETL job promotes the data to a modeled table. This job enables us to build a Hive metastore, which allows users to query data as a Hive table using a language like SQL. 

Uber ATG also has an internal tool called QueryBuilder that provides all the functionality of a relational database frontend while also providing state-of-the-art visualization tools to understand the story behind the data. 


Pedestrian density map
Figure 1: Our internal QueryBuilder tool not only provides a means to access our datasets, but also offers visualizations of the data, such as showing pedestrian density.


This data and associated visualizations allow engineers to identify and understand how our perception model performs geospatially. Being able to run queries of this nature and visualize them is extremely useful for iterating on autonomy models because it helps us gain more insight into the data and find more underrepresented scenes that we need to train the model on.

Training our model on these less common scenes ensures our self-driving cars will react safely if they encounter a similar situation while out on the road.

Pedestrian density map
Figure 2: Using a visualization, created in, of data collected by our cars, we can see that pedestrian density is higher at intersections and building exits, information that can be used by our machine learning models.



Data is the fuel ushering Uber ATG into the self-driving future. The ability to query data that replicates traffic scenarios ranging from the everyday to the very rare will help prepare our self-driving cars for any situation. Data accessibility and proper tooling, as demonstrated by the ATG Analytics Platform, enable autonomy engineers and scientists to unlock the potential behind the significant volume of data that we have amassed at ATG. Better intelligence around our self-driving car logs gives Uber a strategic edge as we efficiently iterate on world-class machine learning models. 

There is no shortage of work to be done in making the future of self-driving cars a reality. If you are interested in riding along with us during this exciting time, join us!