At Uber, magical customer experiences depend on accurate arrival time predictions (ETAs). We use ETAs to calculate fares, estimate pickup times, match riders to drivers, plan deliveries, and more. Traditional routing engines compute ETAs by dividing up the road network into small road segments represented by weighted edges in a graph. They use shortest-path algorithms to find the best path through the graph and add up the weights to derive an ETA. But as we all know, the map is not the terrain: a road graph is just a model, and it can’t perfectly capture conditions on the ground. Moreover, we may not know which route a particular rider and driver will choose to their destination. By training machine learning (ML) models on top of the road graph prediction using historical data in combination with real-time signals, we can refine ETAs that better predict real-world outcomes.

For several years, Uber used gradient-boosted decision tree ensembles to refine ETA predictions. The ETA model and its training dataset grew steadily larger with each release. To keep pace with this growth, Uber’s Apache Spark* ^{™}* team contributed upstream improvements [1, 2] to XGBoost to allow the model to grow ever deeper, making it one of the largest and deepest XGBoost ensembles in the world at that time. Eventually, we reached a point where increasing the dataset and model size using XGBoost became untenable. To continue scaling the model and improving accuracy, we decided to explore deep learning because of the relative ease of scaling to large datasets using data-parallel SGD [3]. To justify switching to deep learning we needed to overcome three main challenges:

**Latency:**The model must return an ETA within a few milliseconds at most.**Accuracy:**The mean absolute error (MAE) must improve significantly over the incumbent XGBoost model.**Generality:**The model must provide ETA predictions globally across all of Uber’s lines of business such as mobility and delivery.

To meet these challenges, Uber AI partnered with Uber’s Maps team on a project called DeepETA to develop a low-latency deep neural network architecture for global ETA prediction. In this blog post, we’ll be walking you through some of the learnings and design choices that helped DeepETA become the new production ETA model at Uber.

# Problem Statement

In the past few years, there has been increasing interest in systems that combine physical models of the world with deep learning. We take a similar approach to ETA prediction at Uber. Our physical model is a routing engine that uses map data and real-time traffic measurements to predict an ETA as a sum of segment-wise traversal times along the best path between two points. We then use machine learning to predict the residual between the routing engine ETA and real-world observed outcomes. We call this hybrid approach ETA post-processing, and DeepETA is an example of a post-processing model. From a practical perspective, it’s generally easier to assimilate new data sources and accommodate fast-changing business requirements by updating the post-processing model than it is to refactor the routing engine itself.

To be able to predict the ETA residual, a post-processing ML model takes into account spatial and temporal features, such as the origin, destination and time of the request, as well information about real-time traffic and the nature of the request, such as whether it is a delivery dropoff or rideshare pickup, as illustrated in Figure 1. This post-processing model is the highest QPS (queries per second) model at Uber. This model needs to be fast so as not to add too much latency to an ETA request, and they need to improve ETA accuracy as measured by mean absolute error (MAE) across different segments of the data.

# How We Made It Accurate

The DeepETA team tested and tuned 7 different neural network architectures: MLP [4], NODE [5], TabNet [6], Sparsely Gated Mixture-of-Experts [7], HyperNetworks [8], Transformer [9] and Linear Transformer [10]. We found that an encoder-decoder architecture with self-attention provided the best accuracy. Figure 2 illustrates the high-level architecture of our design. In addition, we tested different feature encodings and found that discretizing and embedding all of the inputs to the model provided a significant lift over alternatives.

## Encoder with Self-Attention

Many people are familiar with the Transformer architecture because of its applications in Natural Language Processing and Computer Vision, but it might not be obvious how Transformers can be applied to tabular data problems like ETA prediction. The defining innovation of Transformer is the self-attention mechanism. Self-attention is a sequence-to-sequence operation that takes in a sequence of vectors and produces a reweighted sequence of vectors. Details can be found in the Transformer paper [9].

In a language model, each vector represents a single word token, but in the case of DeepETA, each vector represents a single feature, such as the origin of the trip or the time of day. Self-attention uncovers pairwise interactions among the K features in a tabular dataset by explicitly computing a K*K attention matrix of pairwise dot products, using the softmax of these scaled dot-products to reweight the features. When the self-attention layer processes each feature, it looks at every other feature in the input for clues and outputs the representation of this feature as a weighted sum of all features. This process is illustrated in Figure 3. Through this way, we can bake the understanding of all the temporal and spatial features into the one feature currently being processed and focus on the features that matter. In contrast to language models, there is no positional encoding in DeepETA since the order of the features doesn’t matter.

Taking a trip from origin A to destination B as an example, the self-attention layer scales the importance of the features given the time of the day, origin and destination, traffic conditions etc. A visualization of the self-attention is shown in Figure 4 (generated from Tensor2Tensor [11]) with 8 colors corresponding to 8 attention heads and shares corresponding to random generated attention weights.

## Feature Encoding

### Continuous and Categorical Features

The DeepETA model embeds all categorical features, and bucketizes all continuous features before embedding them. Somewhat counterintuitively, bucketizing continuous features led to better accuracy than using continuous features directly. While bucketizing isn’t strictly necessary because neural networks can learn any non-linear discontinuous function, a network with bucketized features may have an advantage because it doesn’t have to spend any of its parameter budget learning to partition the input space. Like the Gradient Boosted Decision Tree Neural Network paper [12], we found that using quantile buckets provided better accuracy than equal-width buckets. We suspect that quantile buckets perform well because they maximize entropy: for any fixed number of buckets, quantile buckets convey the most information (in bits) about the original feature value compared to any other bucketing scheme.

### Geospatial Embeddings

Post-processing models receive the origin and destination of a trip as latitudes and longitudes. Because these start and end points are very important for predicting ETAs, DeepETA encodes them differently from other continuous features. Location data is distributed very unevenly over the globe and contains information at multiple spatial resolutions. We therefore quantize locations into multiple resolution grids based on latitude and longitude. As the resolution increases, the number of distinct grid cells grows exponentially and the average amount of data in each grid cell decreases proportionally. We explored three different strategies for mapping these grids to embeddings:

**Exact indexing**, which maps each grid cell to a dedicated embedding. This takes up the most space.**Feature hashing**[16], which maps each grid cell into a compact range of bins using a hash function. The number of bins is much smaller than with exact indexing.**Multiple feature hashing**, which extends feature hashing by mapping each grid cell to multiple compact ranges of bins using independent hash functions. See Figure 5:

Our experiments showed that while feature hashing saved space compared to exact indexing, the accuracy was the same or slightly worse depending on the grid resolution. This is likely due to hash collisions that cause some information to be lost. Multiple feature hashing provided the best accuracy and latency while still saving space compared to exact indexing. This implies that the network is able to combine the information from multiple independent hash buckets to undo the negative effects of single-bucket collisions.

# How We Made It Fast

DeepETA’s serving latency requirements are very stringent. While it’s possible to accelerate inference by using specialized hardware or post-training optimizations, in this section we discuss architectural design decisions that helped DeepETA minimize latency.

## Fast Transformer

While the transformer-based encoder provided the best accuracy, it was too slow to meet the latency requirements for online real-time serving. The original self-attention model has quadratic complexity because it computes a K K attention matrix from K inputs. There have been multiple research works that linearize the self-attention calculation, for example linear transformer [10], linformer [13], performer [14]. After experimentation, we chose the linear transformer, which uses the kernel trick to avoid calculating the attention matrix.

To illustrate the time complexity, let’s use the following example: say we have K inputs with dimension d. The original transformer’s time complexity is O(K^{2}d), while the linear transformer’s time complexity is O(Kd^{2}). If we have 40 features all with 8 dimensions, i.e. K = 40, d = 8, K^{2}d = 12,800, while Kd^{2} = 2,560. It’s clear that the linear transformer is faster whenever K>d.

## More Embeddings, Fewer Layers

Another secret to making DeepETA fast is to utilize feature sparsity. While the model has hundreds of millions of parameters, any one prediction touches only a tiny fraction of them, roughly 0.25%. How did we achieve this?

First of all, the model itself is relatively shallow with just a handful of layers. The vast majority of the parameters exist in embedding lookup tables. By discretizing the inputs and mapping them to embeddings, we avoid evaluating any of the unused embedding table parameters.

Discretizing the inputs gives us a clear speed advantage at serving time compared to alternative implementations. Take the geospatial embeddings pictured in Figure 5 as an example. To map a latitude and longitude to an embedding, DeepETA simply quantizes the coordinates and performs a hash lookup, which takes O(1) time. In comparison, storing embeddings in a tree data structure would require O(log N) lookup time, while using fully-connected layers to learn the same mapping would require O(N^{2}) lookup time. Seen from this perspective, discretizing and embedding inputs is simply an instance of the classic space vs time tradeoff in computer science: by precomputing partial answers in the form of large embedding tables learned during training, we reduce the amount of computation needed at serving time.

# How We Made It General

One of the design goals of DeepETA is to provide a general ETA model that serves all of Uber’s lines of business across the globe. This can be challenging because different lines of business have different needs and different data distributions. The overall model structure is shown in Figure 6 below.

## Bias Adjustment Decoder

Once we have learned meaningful feature representations, the next step is to decode them and make predictions. In our case, the decoder is a fully connected neural network with a segment bias adjustment layer. The distribution of absolute errors varies by a lot across delivery trips vs rides trips, long vs short trips, pick-up vs drop-off trips, and also across global mega-regions. Adding bias adjustment layers to adjust the raw prediction for each of the different segments can account for their natural variations and in turn improve the MAE. This approach performs better than simply adding segment features to the model. The reason we implemented a bias adjustment layer instead of a multi-task decoder is due to latency constraints. We also employ a few tricks to further improve prediction accuracy, such as using ReLU at output to force predicted ETA to be positive; clamping to reduce the effect of extreme values.

## Asymmetric Huber Loss

Different business use cases require different types of ETA point estimates and will also have varying proportions of outliers in their data. For example, we want to estimate a mean ETA for fare computation but need to control for the effect of outliers. Other use cases may call for a specific quantile of the ETA distribution. To accommodate this diversity, DeepETA uses a parameterized loss function, asymmetric Huber loss, which is robust to outliers and can support a range of commonly used point estimates.

Asymmetric Huber loss has two parameters, delta and omega, that control the degree of robustness to outliers and the degree of asymmetry respectively. By varying delta (as shown in Figure 7), squared error and absolute error can be smoothly interpolated, with the latter being less sensitive to outliers. By varying omega, you can control the relative cost of underprediction vs overprediction, which is useful in situations where being a minute late is worse than being a minute early. These parameters not only make it possible to mimic other commonly used regression loss functions, but also make it possible to tailor the point estimate produced by the model to meet diverse business goals.

# How We Train and Serve the Model

We leveraged the Canvas framework from Uber’s ML platform – Michelangelo [15] to train and deploy the model. The exact architecture we prototyped is essentially depicted in Figure 8. Once the model is trained and deployed to Michelangelo, we need to serve those predictions to users for real-time ETA prediction. The high-level architecture is shown in Figure 9 below: Uber consumers’ requests are routed through various services to the uRoute service. The uRoute service serves as a frontend for all routing lookups. It makes requests to the routing engine to produce route-lines and ETAs. It uses this ETA and other model features to make requests to the Michelangelo Online prediction service to get predictions from the DeepETA model. Periodic auto retraining workflows are set up to retrain and validate the model.

**Conclusions and Future Work**

We have launched this DeepETA model into production for global 4-wheel ETA prediction. The DeepETA model launch makes it both possible and efficient to train and serve large-scale Deep Learning models that predict ETAs better than XGBoost approaches. DeepETA delivers an immediate improvement to metrics in production and establishes a model foundation that can be reused for multiple consumer use cases.

DeepETA enables the ETA team to explore diverse model architectures and output multiple ETAs that can be tailored for each consumer. More work is underway to further expand the accuracy improvements it can deliver by looking at every aspect of the modeling process covering Dataset, Features, Transformations, Model Architectures, Training Algorithms, Loss Functions, and Tooling/Infrastructure Improvements. In the future, the team will explore enhancements such as continuous, incremental training, which will enable ETAs to be trained on fresher data.

**Acknowledgements **

DeepETA is a product of an Uber AI <> Maps partnership. We’d like to thank the following people who helped make this possible:

- Software Engineers: Michael Albada, Apurva Bhandari, Aditya Bhave, Tanmay Binaykiya, Chongxiao Cao, Sashikanth Chandrasekaran, Eric Chen, Hao Jiang, Vladimir Kuzmin, Michael Mallory, Michael Mui, Sean Po, Rich Porter, Vivek Sah, Raajay Viswanathan, Joseph Wang, Peng Zhang
- Data Scientists: Wenqi Hu, Alvira Swalin
- Research Scientists: Olcay Cirit, Eric Frank, Xinyu Hu, Felipe Petroski Such
- Product Managers: Wataru Ueno
- Technical Program Managers: Samuel Chan, Ramit Hora
- Eng Managers: Bozhena Bidyuk, Himaanshu Gupta, Sally Lee, Xu Ning
- Leadership: Jaikumar Ganesh, Bobby Parikh, Smitha Shyam, Thomas Williams

*Apache Spark ^{™} and Spark^{™} are trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks.*

**References**

- Distributed training upstream improvement contributed by Uber’s Spark team.
- Scalability upstream improvement contributed by Uber’s Spark team.
- Meet Horovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlow.
- Multilayer perceptron
- Popov, Sergei, Stanislav Morozov, and Artem Babenko. “Neural oblivious decision ensembles for deep learning on tabular data.” arXiv preprint arXiv:1909.06312 (2019).
- Arık, Sercan O., and Tomas Pfister. “Tabnet: Attentive interpretable tabular learning.” arXiv (2020).
- Shazeer, Noam, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. “Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.” arXiv preprint arXiv:1701.06538 (2017).
- Ha, David, Andrew Dai, and Quoc V. Le. “Hypernetworks.” arXiv preprint arXiv:1609.09106 (2016).
- Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. “Attention Is All You Need.” In Advances in neural information processing systems, pp. 5998-6008. 2017.
- Katharopoulos, Angelos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. “Transformers are rnns: Fast autoregressive transformers with linear attention.” In International Conference on Machine Learning, pp. 5156-5165. PMLR, 2020.
- Vaswani, Ashish, Samy Bengio, Eugene Brevdo, Francois Chollet, Aidan N. Gomez, Stephan Gouws, Llion Jones et al. “Tensor2tensor for neural machine translation.” arXiv preprint arXiv:1803.07416 (2018).
- Saberian, Mohammad, Pablo Delgado, and Yves Raimond. “Gradient boosted decision tree neural network.” arXiv preprint arXiv:1910.09340 (2019).
- Wang, Sinong, Belinda Z. Li, Madian Khabsa, Han Fang, and Hao Ma. “Linformer: Self-attention with linear complexity.” arXiv preprint arXiv:2006.04768 (2020).
- Choromanski, Krzysztof, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins et al. “Rethinking attention with performers.” arXiv preprint arXiv:2009.14794 (2020).
- Meet Michelangelo: Uber’s Machine Learning Platform
- Weinberger, Kilian, et al. “Feature hashing for large scale multitask learning.” Proceedings of the 26th annual international conference on machine learning. 2009.