*By Slawek Smyl, Jai Ranganathan, Andrea Pasqua*

Uber’s business depends on accurate forecasting. For instance, we use forecasting to predict the expected supply of drivers and demands of riders in the 600+ cities we operate in, to identify when our systems are having outages, to ensure we always have enough customer obsession agents managing our support systems, and of course, to plan our business expenditures.

Because of that, Uber has invested heavily in building deep expertise in the forecasting realm. Our teams are at the forefront of cutting edge techniques in machine learning, probabilistic programming, and many other methods to ensure the accuracy of our forecasting algorithms.

Recently, one of our leading data scientists, Slawek Smyl, was named the winner of the M4 Competition, the latest edition of the renowned Makridakis (M) Competition, a challenge for which researchers develop ever more accurate time series forecasting models.

Named after the lead organizer, Spyros Makridakis, the M Competition has been one of the most important events in the forecasting community since 1982. Held roughly once-a-decade, the competitions compare the accuracy of different time series forecasting methods, from naive forecasting to advanced new statistical models and machine learning methods. The competitions deal solely with time series forecasting, without any additional regressors—the whole data of a series is just a vector of numbers.

The recurring results of the M Competitions, which occasionally caused considerable anguish among some researchers, is that simple methods do as well or better than more advanced ones. The M4 Competition used a large data set—100,000 time series—and, generally speaking, the results confirmed this hypothesis: pure machine learning and neural network (NN) methods performed worse than standard algorithms like ARIMA or Exponential Smoothing (ES), and still worse against various combinations of these base statistical methods.

However, the winner of the competition, with a solid margin, was Slawek’s hybrid Exponential Smoothing-Recurrent Neural Networks (ES-RNN) method. It mixes hand-coded parts like ES formulas with a black-box recurrent neural network (RNN) forecasting engine. Versions of the ES-RNN model are under development to tackle some of the most challenging problems in time-series forecasting here at Uber, across various use cases.

In this article, Slawek discusses his winning methods and how others can leverage his research for their own time series forecasting needs.

Take it away, Slawek…

### M4 Competition data set

The table below provides some details on the series used during the competition. Contestants were asked to provide the point forecast (expected value) and forecast intervals of 95 percent coverage, so two more forecasts, one each for 97.5 and 2.5 percentiles.

The prediction horizons varied, e.g., six for yearly, 18 for monthly, and 48 for hourly series. The series were of varying sizes and sometimes with a very broad range, for example between 42 and 2,794 for a monthly series. Because the importance of early points in very long time series’ was questionable, and a long series would impose heavy computational costs, I tended to chop them and use no more than the last 20 years of monthly series and 60 years of yearly series. The origin of the series, e.g., finance and macro(economics), were known, and I used them as one-hot encoded features.

### Exponential smoothing

Exponential smoothing is a venerable family of time series forecasting algorithms that were first proposed over 60 years ago with a simple algorithm:

where is the smoothing factor between 0 and 1. The algorithm says that the forecast of a next step is equal to the forecast of the previous step adjusted by part of the previous error.

This simple formula was extended with hidden state variables, a.k.a. unobserved components, like level, trend, and seasonality, and now comprises over 15 methods, implemented, for example, in the *forecast* package of R. Exponential Smoothing methods perform well on business time series, and one of the best known versions is Holt-Winters, with multiplicative seasonality. Its formulas are as follows:

where is a state variable called *,**, *another state variable is a local linear trend, and are multiplicative seasonality coefficients, so they tend to stay around 1. are smoothing coefficients between 0 and 1, fitted by an optimization algorithm. *Level* is as smoothed version of y.

The single step ahead forecast is .

Now, this approach works quite well, but there is no particular reason to assume that the trend component should be linear; it is just that the linear form allows efficient model fitting. Also, since the model is fitted per series, there is no possibility of cross-series learning. Multiple steps ahead, nonlinear forecasting can be achieved via an NN trained on all series, however, they have their own set of requirements and problems when it comes to forecasting which we’ll discuss in detail later on.

### Neural networks for forecasting

Neural Networks (NN) are a very broad family of capable machine learning models, but, like almost all of them, they do not have notion of a time axis. The series data needs to be preprocessed and doing this properly is not straightforward. NNs tend to have too many parameters (weights) to be fitted for each time series. Learning across many time series potentially solves this issue and opens up the possibilities of cross-learning, but it requires even more careful data preparation (especially normalization and often deseasonalization). Even with very good preprocessing and network architectures, the results tend to disappoint: it appears that, trained across many time series, NNs, and even RNNs, average their responses too much. In other words, the outputs are not time series-specific enough.

My M4 Competition model attempted to solve this problem by being hierarchical–part time series-specific and part global. Creating this kind of model was possible thanks to the recent creation of Dynamic Computational Graph neural network systems (DGNNs), like DyNet and Pytorch, and the very recent “eager execution” mode in TensorFlow. Next, we discuss how to approach preprocessing.

#### Preprocessing

Preprocessing is a very important step; making mistakes here can ruin the NN forecasting performance, no matter how sophisticated the NN architecture.

I used constant size input and output windows. The output size was equal to the prediction horizon, and the input size was set up partly by experimentation and partly by a rule of thumb that it should be at least equal to the seasonality, e.g., twelve for a monthly series. For non-seasonal series, e.g. yearly, I made it a bit smaller than the prediction horizon. The input window can be useful when its larger and good features are extracted skillfully, but its size is also limited by the length of the series. In the M4 data set, for instance, there were many short series.

A necessary step of preprocessing is normalization. This can be done across whole time series, but it is better to conduct locally: we move the input and output windows and the time series values within both windows get divided by some reasonably stable value that nonetheless changes with the series, e.g., the median of last seasonality-number of points. In this project I used the *level*.

When dealing with a data set of time series of unknown origins, timestamps, and different seasonality patterns as in the M4 competition, NNs tend to struggle with seasonality. A standard remedy is to deseasonalize the series first using a statistical procedure like the STL (Seasonal and Trend decomposition using Loess). Still, a deseasonalization algorithm, no matter how robust and sophisticated, is designed to fulfill some statistical criteria that are likely to be only partly aligned with a goal of producing good features to NN with the overarching goal of strong forecasting. It is better when deseasonalization is an integral part of the forecasting algorithm, as in the case of ES, as opposed to occurring before the learning process.

### Hybrid model

With these insights, it made sense to merge Holt-Winters and NN models (actually RNNs, as they are better for sequences and time series):

We do not need to account for local linear trend:

and the forecasting formula becomes:

where is a vector of preprocessed data and the multiplication is element-wise. is composed of normalized and deseasonalized time series-derived features, typically seasonality-long; its single, scalar component, is calculated as:

Additionally, one-hot encoded origin of series was added.

It is important to remember that , , and are per-series coefficients, while the RNN is global and trained on all series—it is a hierarchical model. The computational graph is different for each series, as it contains some series-specific parameters, and that’s why we need to leverage a dynamic computational graph NN system.

How is it really done? I doubt you can do it in Keras, as you need a low-level control. I was using DyNet and coded in C++. First, we collect per-series parameters:

auto additionalParams= additionalParams_map[series];

Expression levSm_ex = logistic(parameter(cg, additionalParams.levSm)); //level smoothing

Expression sSm_ex = logistic(parameter(cg, additionalParams.sSm)); //seasonality smoothing

…

Then we use them for calculating *levels* and seasonality coefficients:

for (int i=1; i<m4Obj.vals.size();i++) {

Expression newLevel_ex=m4Obj.vals[i]*cdiv(levSm_ex,season_exVect[i]) + (1-levSm_ex)*levels_exVect[i-1];

levels_exVect.push_back(newLevel_ex);

Expression newSeason_ex=m4Obj.vals[i]*cdiv(sSm_ex,newLevel_ex) + (1-sSm_ex)*season_exVect[i];

season_exVect.push_back(newSeason_ex);

}

…cdiv() is element-wise division, m4Obj.vals contains the time series values, . Running this loop adds the formulas above into the computational graph.

Next, in another loop, we walk through series points and prepare the input as follows:

Expression input0_ex=input(cg,{INPUT_SIZE},input_vect);

Expression input1_ex=cdiv(input0_ex,inputSeasonality_ex); //deseasonalization

input1_ex= cdiv(input1_ex, levels_exVect[i]); //normalization

joinedInput_ex.emplace_back(noise(squash(input1_ex), NOISE_STD)); //normalization+noise

joinedInput_ex.emplace_back(input(cg, { NUM_OF_CATEGORIES }, m4Obj.categories_vect));

Expression input_ex = concatenate(joinedInput_ex);

…

and finally pass it to the RNN stack:

rnn_ex = rNNStack[0].add_input(input_ex);

for (int il=1; il<dilations.size(); il++)

rnn_ex=rnn_ex+rNNStack[il].add_input(rnn_ex); //resNet-style

You might have noticed that the RNN call is not a single call. That’s because the RNN stack was composed of several blocks of dilated long short-term memory units (LSTMs) connected by ResNet-style shortcuts, as described in the following section.

### Neural architectures

Several architectures were deployed, guided mostly by experimentation (backtesting results). At a high level, they were all dilated LSTM-based blocks, sometimes followed by a non-linear layer, and always followed by a linear “adaptor” layer whose main job was to convert from a hidden-layer size to the output size (i.e., prediction horizon).

The LSTM blocks were composed of a number (1-2) of sub-blocks optionally connected with ResNet-style shortcuts. Each sub-block was a sequence of one to four layers belonging to one of the three types of dilated LSTMs: standard (Chang 2017), with attention mechanism (Qin 2017), and a residual version using a special type of shortcut (Kim, El-Khamy, and Lee 2018). Figure 2 shows three examples of configurations for:

**Point forecast of quarterly series**, which consists of two blocks, each of two dilated LSTMS, connected by a single “classical” shortcut around the second block.**Point forecast of monthly****series**, which consists of a single block composed of four dilated LSTMs, with residual connections similar to (Kim, El-Khamy, and Lee 2018). Note that the shortcut arrows point correctly into the inside of the residual LSTM cell–this is a non-standard residual shortcut.**Prediction interval forecast****of yearly series**, which consists of a single block composed of two dilated LSTMs that leverage the attention mechanism, followed by a dense non-linear layer (with tanh() activation), and then by a linear adaptor layer, of the size equal to double of the output size, allowing us to forecast both lower and upper quantiles at the same time.

### Training loss function

The M4 Competition error metric of point forecasts was the average of Symmetric Mean Absolute Percent Error (sMAPE) and Mean Absolute Scaled Error (MASE). Both of them are mean absolute differences between forecasted and actual values, although differently normalized. In my ES-RNN model, input and immediate output values of the RNN stack were also normalized, although again a bit differently (divided by *level*), so I considered the L1 difference as similar enough to sMAPE and MASE in that context, and at first I used it as my training loss function. However, when backtesting, the system tended to have a positive bias. To counter this, I decided to use a pinball loss function that features a non-symmetric penalty (and minimizing on it leads to the quantile regression). It is defined as:

where Q is the quantile, e.g. 0.5, (in which case it is the same as the L1 difference).

To counter the positive bias, I used values a bit lower, like 0.48.

The M4 Competition error metric for prediction intervals coverage was its Mean Scaled Interval Score (MSIS). It is a single metric that scores at the same time upper and lower prediction intervals. Its formula is:

where is the upper interval and is the lower one, the indicator function **1** is 1 when the condition holds, otherwise it is 0, is the significance level, here 0.05, and the denominator, as in MASE, is an average absolute error of the naive forecast in-sample (on known part of series). You can review the M4 Competition Guide for more details on this formula.

I am not describing the denominator in detail because, as in the case of metrics for the point forecast, I did not use it, relying on the fact that my inputs to and immediate outputs from the RNN stack were normalized.

So my loss, when forecasting prediction intervals, was the numerator of the above formula.

### Code and documentation

The M4 Competition github repository contains a number of benchmark methods provided by organizers and the contestants branches. Check out my code and more detailed documentation to learn more about my hybrid model.

### Moving forward

In addition to being at the center of much experimentation and research, Slawek’s ES-RNN model and other similar ones have found multiple applications at Uber. Given the complexity of our problem space, cutting-edge solutions are needed to tackle it effectively.

For instance, one area of intense development is forecasting sparse time series, a natural fit for a hybrid model like ES-RNN, given the ability of machine learning models to transfer knowledge from denser to sparser time series.

Additionally, the M4 Competition featured pure time series without regressors, but in practical applications, additional regressors are available, ranging from weather, event information, geographical attributes, and so on. Incorporating them could enrich the model and improve performance.

And yet a third area of exploration is forecasting time series that require domain knowledge. Hybrid models like ES-RNN offer a way of encoding that knowledge, making the forecast better suited to the real-world applications Uber encounters. We are also planning to include this model into our future Pyro Forecasting Library, and therefore make it available for Python users.

The remarkable performance of ES-RNN in the M4 Competition makes it a major forecasting milestone.

*If you’re interested in learning more how we apply forecasting to improve trip experiences on the Uber platform, apply for a role on our team.*

*Subscribe to our newsletter** to keep up with the latest innovations from Uber Engineering.*