By Nikhil Joshi & Isabel Geracioti
Millions of Uber trips take place each day across nearly 80 countries, generating information on traffic, preferred routes, estimated times of arrival/delivery, drop-off locations, and more that enables us to facilitate better experiences for users.
To make our data exploration and analysis more streamlined and efficient, we built Uber’s data science workbench (DSW), an all-in-one toolbox for interactive analytics and machine learning that leverages aggregate data. DSW centralizes everything a data scientist needs to perform data exploration, data preparation, ad-hoc analyses, model exploration, workflow scheduling, dashboarding, and collaboration in a single-pane, web-based graphical user interface (GUI).
Leveraged by data science, engineering, and operations teams across the company, DSW has quickly scaled to become Uber’s go-to data analytics solution. Current DSW use cases include pricing, safety, fraud detection, and navigation, among other foundational elements of the trip experience. In this article, we discuss two main themes: 1) the challenges we had building and, in particular, encouraging mass adoption of DSW, and 2) how the workbench made data science at Uber more streamlined and scalable than ever before.
Finding a scalable, flexible solution
Before deciding to build our data science workbench, we evaluated multiple third-party solutions and determined that they could not easily scale to number of users or volume of data we anticipated on the platform, nor would they integrate well with Uber’s internal data tools and platforms. We also realized that building our own platform would enable us to target specific use cases, such as geospatial analytics, custom visualization, integration with Michelangelo (our machine learning framework), and deep learning. We concluded that our best option was to build an in-house solution. Before we began development, however, we needed to better understand user needs.
Standardizing a complex discipline
We began building DSW with the knowledge that our final design needed to support thousands of users across several offices worldwide. On top of that, we needed to nail down the use cases and preferences of a user base with incredibly varied skillsets and preferences.
Nail it, then scale it
Given the diversity of our data science teams, a one-size fits all approach would have failed. At the same time, we did not have the engineering time or resources to build out an endless suite of features. To succeed, we needed to strike a balance between the two sides of the spectrum.
To nail the product-market fit, we spent weeks conducting user research through surveys, interviews, and focus groups before writing a line of code. Our research led us to a lengthy list of requirements for our new solution:
- Support for different productivity tools (such as RStudio, Jupyter Notebook, Zeppelin Notebook, and Shiny Dashboards), coding languages (including R, Py, and Scala), and distributed computing systems (for instance, Hive, Vertica, Spark, and Elasticsearch).
- A fully hosted environment with no manual setup.
- Dedicated compute and storage, ensuring that multiple users could “share” the platform without stealing resources from each other.
- Pre-configured RStudio Server and Jupyter Notebooks with all the necessary internal and external libraries, plus additional customization options.
- Integration with Uber’s infrastructure, from workflow management and visualization tools to data sources and analytics engines.
- A UI that facilitates easy collaboration and knowledge sharing between colleagues.
With DSW, we were able to build a product that satisfied all of these requirements—and more. Since the development process was largely an integration effort, let us dive straight into what the DSW stack looks like today.
The DSW stack
DSW is built on top of a highly scalable, containerized infrastructure capable of supporting thousands of users. On the back end, DSW is powered by our Hadoop and warehousing stack, consisting of thousands of nodes for large-scale distributed processing. DSW’s management service is written in Go and the web-based front-end is written in React and Redux.
To leverage DSW, users first create sessions (in other words, independent compute units—or Docker containers) to perform data analyses. At Uber, containers are provisioned via an internal resource management and scheduling framework called Micro Deploy. Each session is pre-baked with the internal and OSS libraries needed to get started, including pandas, scikit-learn, SciPy, NumPy, Matplotlib, Seaborn, Folium, Bokeh, and more. On top of these pre-baked packages, DSW allows its users to customize their sessions with hand-installed packages.
Within a session, users can build Shiny and Python dashboards (Bokeh) directly from DSW. Each user gets a dedicated dashboard server where they can publish dashboards. The file storing and sharing features are facilitated by CephFS and mounted through FUSE. Each user’s Docker container has a mounted Ceph directory which persists all generated content. Files are stored in a highly available file system, allowing them to live beyond the lifespan of a session. DSW employs distributed locking through a ZooKeeper client to avoid racing issues in creating sessions and refreshing files.
DSW also natively supports job scheduling. Via a lightweight graphical interface, data consumers can automate many daily activities (like report generation, data quality checks, retraining models, and publishing dashboards etc.) and put those tasks on a schedule. DSW automatically runs these jobs at the defined cadence and notifies users of failures.
Each session in DSW pulls assigned jobs periodically from the jobs data and runs them. If a session fails or crashes for any reason, jobs are automatically picked up when the session restarts.
Users can also access machine learning frameworks like scikit-learn, TensorFlow, Theano, and Keras from their sessions, allowing them to quickly experiment, train, and test various supervised and unsupervised algorithms, as well as our Michelangelo machine learning-as-a-service system. The ability to access various frameworks from a single tool has helped democratize machine learning at Uber.
DSW adoption and impact
In just three months, the beta version of Data Science Workbench was ready for launch. But building DSW was only half of the equation. The next step was convincing Uber’s data science teams to use and trust what might have seemed like just another tool. Collaborative features like file and dashboard sharing are useless if only a fraction of data scientists are actively using the workbench.
We applied two tactics to ensure that the company aligned behind DSW as the platform for Data Science at Uber. First, we embarked on an internal marketing campaign, including conventional strategies like visual branding (i.e., a logo and graphics), promotional emails, and newsletters. Second, we spent time reaching out to teams and individuals to build relationships and keep feedback channels open by presenting at team and department meetings, giving tech talks on the new project, and simply striking up conversations with our users.
Uber’s data science workflow is no longer splintered and resource-strained, meaning that users of various engineering and data science backgrounds can onboard quickly. In fact, DSW has reduced the development environment setup time reduced from over an hour to just a few minutes.
Today, data scientists use DSW for text mining and processing, machine learning, survival modeling, consumer profile modeling, data preparation for visualizations, and more. Below, we outline a few examples of how this powerful analytics toolbox has been used across the company:
Robby, a data scientist on the UberEATS team, uses DSW to optimize incentives for UberEATS delivery partners. He built a machine learning model using Python in DSW to estimate the supply curve, forecast demand, and measure overall marketplace efficiency. The output training data, model parameters, and optimal incentives are published to a Shiny dashboard, which is also hosted in DSW. This dashboard is shared to City Operations Analysts across the world, empowering them to evaluate the marketplace efficiency in their own cities.
Ana, an Analytics Manager for LATAM based out of Brazil, works on driver satisfaction and well-being. To enable cities to better understand their driver partners, she has built a sophisticated but easy to consume sentiment analyzer on DSW called, Dràuzio. Hundreds of thousands of Uber driver partners express their joys, concerns, and provide feedback in CSAT surveys. Dràuzio parses, cleans, mines, and models text fields to surface the perceived sentiment in every city. Analysts in Brazil can now view sentiments in their own cities through a browser-based dashboard published using Shiny. The dashboard uses a heat-map visualization to show the strength of each sentiment.
With this information at their fingertips, operations teams across the world can gain a deeper understanding of the riders and driver partners in their cities and make more informed decisions about how to improve the Uber experience.
Our long-term vision is for data science workbench to serve as a one-stop shop for Uber’s data scientists, as well as contribute to our goal of democratizing machine learning. More specifically, we plan to build in additional support for deep learning by integrating DSW with Uber’s machine learning-as-a-service platform, Michelangelo.
If creating solutions to make data science more accessible and efficient appeals to you, apply for a role on Uber’s Data Platform team!
Nikhil Joshi leads product management for Uber’s Data Platforms and Infrastructure team. Isabel Geracioti is a technical writer at Uber with a focus on data platforms and infrastructure.