Earlier this year, we introduced Uber’s Customer Obsession Ticket Assistant (COTA) system, a tool that leverages machine learning and natural language processing (NLP) techniques to recommend support ticket responses (Contact Type and Reply) to customer support agents, with Contact Type being the issue category that the ticket is assigned to and Reply the template agents use to respond. After integrating it into our Customer Support Platform, COTA v1 reduced English-language ticket resolution times by over 10 percent while delivering service with similar or higher levels of customer satisfaction.

For Uber, COTA v1 was just the beginning. In an effort to improve COTA performance, we conducted offline experiments which showed that deep learning can increase the solution’s top-1 prediction accuracy by 16 percent (from 49 percent to 65 percent) for the Contact Type model and 8 percent (from 47 percent to 55 percent) for the Reply model (for more details, please refer to our KDD paper), a true feat given the complexity of these tasks.

Given these encouraging results, we decided to onboard our deep learning models into Uber’s in-house machine learning platform, Michelangelo. We accomplished this by building a Spark-based deep learning Pipeline to productize the second generation of COTA (COTA v2) using the existing infrastructure of Michelangelo. Given that model performance decays over time, we also built a model management pipeline to automatically retrain and retire models to keep them up-to-date at all times.

After integrating with Michelangelo, our online tests validate that the COTA v2 deep learning system performs significantly better than the COTA v1 system in terms of key metrics, including model performance, ticket handling time, and customer satisfaction.

 

First generation of COTA: challenges and opportunities

While COTA v1 expedited support ticket resolution, there were two primary areas we identified for improvement. First, COTA v1 conducted negative sampling in an overly complex way that made it difficult to train our models. Combined with dependencies in specific data tables, this factor ultimately made retraining COTA a substantial undertaking. Although not insurmountable, this level of difficulty for an ongoing task could, over time, disincentivize regular maintenance.

Second, our original implementation was not extensible enough to be used by future NLP models. We have since made a very conscious effort to develop a deep learning deployment process that opens the door not just for our models, but also for those from all other teams at Uber.

Why deep learning?

The success of COTA v1 motivated us to further invest in our technology stack and explore other support resolution solutions. Serving more than 600 cities worldwide, supporting multiple languages, and facilitating over five communication channels, Uber’s customer support reaches customers across our businesses including ridesharing, Uber Eats, bikesharing, and Uber Freight. The scope and scale of our business added enormous complexity to the challenge we faced. As a result, the number of ways to categorize and solve support tickets are in the order of thousands. In addition, Uber’s growth requires us to iterate at an unprecedented pace. A solution that works today may not work in a few months if we do not take business growth into account.

Deep learning has revolutionized many domains such as machine translation, speech recognition, computer vision, and natural language understanding, and has achieved performance either on-par with or superior to human experts for certain tasks. For example, deep learning models have already outperformed humans on some image classification and recognition tasks. For the task of using retina photographs to detect diabetic eye disease, Google has shown that deep learning algorithms perform on-par with ophthalmologists. The recent success of AlphaGo demonstrates that deep learning algorithms combined with reinforcement learning can even beat the world’s best human Go players in what is often considered humankind’s most complicated board game.

Based on these examples and many more, deep learning seemed like a natural choice for the development of COTA v2. Indeed, via offline experiments, we found that deep learning models could provide much more accurate ticket resolution predictions when compared to COTA v1.

 

Moving to COTA v2 with deep learning

To briefly summarize, COTA v1 was built with topic modeling-based traditional NLP and machine learning techniques that incorporate a mixture of textual, categorical, and numerical features, as shown in Figure 1(a), below:

Figure 1. (a) The model architecture of COTA v1 leverages topic modeling, traditional feature engineering techniques, and a point-wise ranking algorithm, while (b) COTA v2 supports a deep learning architecture with a mixture of input features.

To extract the textual features, an NLP Pipeline was built to process incoming ticket messages. Topic modeling was used to extract feature representation from the text feature. Additional feature engineering was used to generate cosine-similarity. Once each feature was engineered, all the features were fed into a binary point-wise ranking algorithm to predict the Contact Type and Reply responses.

Figure 1(b) depicts the deep learning architecture we used for COTA v2. The text feature goes through typical NLP preprocessing such as text cleaning and tokenization (not shown), and each word in the ticket is encoded using an embedding layer (not shown) to convert the word to a dense representation that further runs through convolution layers to encode the entire text corpus. Categorical features are encoded using an embedding layer to capture the closeness between different categories. Numerical features are batch normalized to stabilize the training process. Our offline experiments show that the COTA v2 deep learning system performs systematically better (8-16 percent improvement) than COTA v1 for both the individual tasks of identifying Contact Type or Reply independently and the joint task of predicting Contact Type and Reply at once.

Figure 2. t-SNE plots depict embeddings learned by deep learning models for a) words and b) contact types.

Figure 2, above, shows the t-Distributed Stochastic Neighbor Embedding (t-SNE) plots of the embeddings we learned via deep learning models. For instance, in Figure 2(a), we visualize some Uber-specific keywords and observe that “vehicle” and “car” are very close to each other in the t-SNE plot of the embedding. Words related to payment, such as “charge,” “credit,” and “fare,” are also clustered together in the plot.

Figure 2(b) denotes the embeddings learned for the contact types with each data point corresponding to one unique contact type. The contact types are color-coded into three major groups: “rider,” “driver,” and “other” (e.g., eater, restaurant, etc.). The t-SNE plot shows clear clustering of rider and driver-related contact types. These visualizations intuitively confirm that the model is learning reasonable representations and suggests that it is capable of capturing correlations and semantical connections between words and relationships between contact types.

In short, deep learning can improve the solution’s top-1 prediction accuracy by 16 percent (from 49 percent to 65 percent) for the Contact Type model, and 8 percent (from 47 percent to 55 percent) for the Reply model compared to COTA v1, which can directly improve the customer support experience.

Challenges and solutions for deploying COTA v2

Given the strong performance of the deep learning models in our offline analysis, we decided to integrate the COTA v2 system into production. However, given the complexity of integrating both NLP transformations and deep learning training, as well as using a large amount of training data, deploying our COTA v2 deep learning models came with its fair share of challenges.

Ideally, we wanted to leverage Spark for the NLP transformations in a distributed fashion. Spark computations are typically done using CPU clusters. On the other hand, deep learning training runs more efficiently on a GPU-based infrastructure. To address this duality, we needed to figure out a way to use both Spark transformations and GPU training, as well as build a unified Pipeline for training and serving the deep learning model.

Another challenge we dealt with was determining how to maintain model freshness given the dynamic nature of Uber’s business. In light of this, a pipeline was needed to frequently retrain and redeploy models.

To solve the first challenge, we built a deep learning Spark Pipeline (DLSP) to leverage both Spark for NLP transformations and GPUs for deep learning training. For the second challenge, we integrated an internal job scheduling tool and built a model life-cycle management Pipeline (MLMP) on top of the DLSP, allowing us to schedule and run each job at the frequency required. These two pipelines enabled us not only to train and deploy deep learning models into Uber’s production system, but also retrain and refresh the models to keep them at peak performance

In the next two sections, we discuss the two pipelines in greater detail.

 

COTA v2’s deep learning Spark Pipeline

In designing our DLSP, we wanted to assign tasks to CPUs and GPUs based on which hardware would be most efficient. Defining the pipeline into two stages, one for Spark pre-processing and one for deep learning, seemed like the best way of allocating the work load. By extending the concept of a Spark Pipeline, we can serve models for both batch prediction and real-time prediction services using our existing infrastructure.

Training

Model training is split into two stages, as shown in Figure 3(a), below:

  1. Pre-processing transformations using Spark: We leverage our large Spark clusters to perform data pre-processing and fit the transformations required for both training and serving. All the transformations performed on the data during pre-processing are saved as Spark transformers, which are then used to build a Spark Pipeline for serving. The distributed pre-processing in the Spark cluster is much faster than pre-processing data on a single node GPU machine. We compute both fitted transformations (transformations that require persisting data, e.g., StringIndexer) and non-fitted transformations (e.g., cleaning up HTML tags from strings etc.) in the Spark cluster.
  2. Deep learning training using TensorFlow: Once the pre-processing from step (1) is complete, we leverage the pre-processed data to train the deep learning model using TensorFlow. The trained model from this stage is then merged with the Spark Pipeline generated in step (1). This produces the final Spark Pipeline encompassing the pre-processing transformers and the TensorFlow model, which can be used to run predictions. We are able to combine the Spark Pipeline with TensorFlow model by implementing a special type of transformer called TFTransformer, which brings the TensorFlow model into Spark. It’s important to note that since all Spark Transformers are backed by Java implementations, the TFTransformer sticks with this pattern.
Figure 3. We built a deep learning Spark Pipeline architecture for a) training models and b) serving requests.

Serving

Figure 3(b) depicts how we serve the trained model using a deep learning Spark Pipeline for both batch prediction and real-time prediction services. The Spark Pipeline built from training contains both pre-processing transformers and TensorFlow transformations. We extended Michelangelo to support serving generic Spark Pipelines, and utilized the existing deployment and serving infrastructure to serve the deep learning model. The pipeline used for serving runs on a Java Virtual Machine (JVM). The performance we see while serving has a latency of p95 < 10ms, which demonstrates the advantage of low latency when using an existing JVM serving infrastructure for deep learning models. By extending Spark Pipelines to encapsulate deep learning models, we were able to leverage the best of both CPU and GPU-driven worlds: 1) the distributed computation of Spark transformations and low-latency serving of Spark Pipelines using CPUs and 2) the acceleration of deep learning model training using GPUs.

 

Model lifecycle management Pipeline: keeping models fresh

To prevent COTA v2 model performance from decaying over time, we built a model lifecycle management Pipeline (MLMP) on top of our DLSP. In particular, we leveraged Uber’s internal job scheduling tool Piper to build an end-to-end Pipeline to retrain and redeploy our models at a fixed frequency.

Figure 4. Our model lifecycle management pipeline consists of six jobs, including Data ETL, Spark Transformations, and Model Merging.

Figure 4, above, depicts the flow of this pipeline. Consisting of six jobs in total, it uses the existing APIs from Michelangelo to retrain the model. These jobs form a directed acyclic graph (DAG) with dependency indicated by the arrows:

  1. Data ETL: This involves writing a data extraction, basic transformation, and loading (ETL) job to prepare data. It typically pulls data from several different data sources, converting it into the right format and putting it into a Hive database.
  2. Spark Transformation: This step transforms raw data (textual, categorical, numerical, etc.) into Tensor format so that it can be consumed by a TensorFlow graph for model training. The underlying transformation utilizes the Spark Engine via Michelangelo in a distributed computing fashion. The transformers are saved to a model store.
  3. Data Transfer: Computer clusters with CPUs perform the Spark transformations. Deep learning training requires GPUs to speed up the progress. Therefore, we transfer output data from Step 2 to GPU clusters.
  4. Deep Learning Training: Once the data is transferred to the GPU clusters. A job is triggered to open a GPU session with a custom Docker container and start the deep learning training process. Once the training is done, a TensorFlow model file is saved to the model store.
  5. Model Merging: The Spark transformers from Step 2 and the TensorFlow model from Step 4 are merged to form the final model.
  6. Model Deployment: The final model is deployed, and a model_id is generated as a reference to the newly deployed model. External microservices can hit the endpoint using the serving framework of Thrift by referencing the model_id.

 

Online test: COTA v1 vs. COTA v2

To validate the performance of the COTA v2 deep learning models we observed offline, we performed an online test before rolling out the system.

Test strategy

To prevent pre-existing sampling bias, we conducted an A/A test before we turned on the A/B test, as shown in Figure 5, below:  

Figure 5. Overall test strategy to compare the COTA v1 and COTA v2 systems.

During both the A/A and A/B tests, support tickets are randomly assigned into control and treatment groups based on a 50/50 split. In the A/A test phase, both the control and treatment groups are receiving predictions from the same COTA v1 models. While in the A/B test phase, the treatment group is served by the COTA v2 deep learning models.

Results

We ran the A/A test for one week and the A/B test for about one month. Figure 6, below, depicts two key metrics we keep track of: model accuracy (we use Contact Type model as an example here) and average handle time per ticket. As shown in Figure 6(a), there is no difference in model performance during the A/A test phase, while there is a big jump after we turned on the A/B test. These results confirm that COTA v2’s deep learning system provides more accurate solutions to agents in comparison to COTA v1.

The average handle time per ticket is much lower for the treatment group throughout the A/B test phase, as shown in Figure 6(b). One additional observation from Figure 6(a) is that model performance decays over time, highlighting the need for the model management Pipeline MLMP shown in Figure 5. (Note: to ensure consistency, we didn’t retrain the models during the course of the experiment).

Our online testing again proved that given enough training data our COTA v2 deep learning models can significantly outperform the classical COTA v1 machine learning models.  

Figure 6. Key metrics from online test: a) model accuracy and b) average handle time during both A/A and A/B tests on a daily basis.

A statistical analysis shows that during A/A testing there was no statistically significant difference between the average handle times of the control and treatment groups, while there was a statistically significant difference during A/B testing. This is a 6.6 percent relative reduction, speeding up ticket resolution times and improving the accuracy of our ticket resolution recommendations. In addition, we also measured customer satisfaction scores and found a slight improvement as a result of using COTA v2.

In addition to improving the customer support experience, COTA v2 will also save the company millions of dollars every year by streamlining the support ticket resolution process.

 

Next steps

Given the strong performance of our deep learning models in COTA v2, we plan to use issue type predictions in the future to determine which customer support agent to route a given ticket to, since those agents typically possess expertise in a specific set of issue types. These updates will increase our likelihood of identifying the right agent to resolve a ticket during the first routing, improving the efficiency of the whole ticket support system.

We are also looking into features that will allow us to more quickly respond to tickets that merely request information, for example, tickets posing questions such as “how do I update my Uber profile picture?” For such tickets, the solution is simply sharing static information (instructions in this case). This might be only a small subset of a few percent of all customer support tickets, but could be automatically addressed by COTA v2 without the oversight of an agent. Streamlining these static responses will help customers save time and empower agents to focus on more challenging tickets, providing better customer care.  

 

If you are interested in tackling machine learning challenges that drive business impact at scale, consider applying for a role on our Applied Machine Learning, Michelangelo, or San Francisco,  Palo Alto, or Bangalore-based Customer Obsession Engineering teams. We are also hiring San Francisco-based product managers, as well as data analysts and scientists for our Customer Obsession team.

 

COTA is a cross-functional collaboration between Uber’s Applied Machine Learning, Customer Support Platform, and Michelangelo teams and Uber AI Labs, with contributions from Piero Molino, Viresh Gehlawat, Yi-Chia Wang, Joseph Wang, Eric Chen, Paul Mikesell, Alex Sergeev, Mike Del Balso, Chintan Shah, Christina Grimsley, Taj Singh, Jai Malkani, Fran Bell, and Jai Ranganathan. We greatly appreciate Molly Vorwerck and Wayne Cunningham for their help editing this article.

Subscribe to our newsletter to keep up with the latest innovations from Uber Engineering.

Comments
Huaixiu Zheng on Linkedin
Huaixiu Zheng
Huaixiu Zheng is a senior data scientist on Uber's Machine Learning team.
Guoqin Zheng on Linkedin
Guoqin Zheng
Guoqin Zheng is a senior machine learning engineer on Uber's Michelangelo team.
Basab Maulik on Linkedin
Basab Maulik
Basab Maulik is a senior software engineer on Uber's Customer Obsession team.
Hugh Williams on Linkedin
Hugh Williams
Hugh Williams is a data science manager on Uber's Applied Machine Learning team.
Jeremy Hermann on Linkedin
Jeremy Hermann
Jeremy Hermann is an engineering manager on Uber's Michelangelo team.