What happens when you have to migrate hundreds of millions of rows of data and more than 100 services over several weeks while simultaneously keeping Uber running for millions of riders? This is the story of how dozens of engineers helped Uber move to Mezzanine in 2014.

 

In the beginning of 2014, we faced the hard fact that given our trip growth (~20% a month), the solution for storing trips was going to run out of steam both in terms of storage volume and IOPS by the end of the year, if not sooner. We launched project Mezzanine as a massive attack to conquer that particular problem. The target launch date was Halloween (October 31st) which is traditionally a very high-volume traffic day.

 

Background

Like most web-based services, the Uber backend system started out as a “monolithic” software architecture with a bunch of app servers and a single database. The system was mainly written in Python and used SQLAlchemy as the ORM-layer to the database. The original architecture was fine for running a relatively modest number of trips in a few cities. By early 2014, the architecture had evolved into a true service-oriented architecture of close to 100 services. The high-level picture of the system was this:

service-oriented-architecture

The real-time services are responsible for executing active trips and the backend services handle rider billing, driver payouts, fraud detection, analytics, city management, etc. The big problem in the diagram above is that we still relied on a single PostgreSQL instance to store most of the data. The pie chart below shows how data was distributed in the database:

data-distribution

Trip data was taking up the largest percentage, was (and still is) the fastest growing data, and also contributed to the most IOPS. We use trip data in order to improve services like uberPOOL, provide rider and driver support, prevent fraud, and develop and test new features like Suggested Pick-ups in San Francisco. Thus, we embarked on project Mezzanine to refactor the system so it would look like this:

mezzanine-overhaul

 

The New Trip Store

The first design decision was the choice of database for the tripstore. Our short list of requirements were:

  • Operationally robust (no data loss, supports backup, replication to secondary data centers, easy to troubleshoot, predictable, operational expertise).
  • Horizontally scalable both in storage capacity and IOPS.
  • High write-availability. We always want to be able to persist a trip to stable storage. It’s okay to trade-off short-term read-availability, as the backend is working mostly in a batch-oriented fashion.
  • Secondary index support. Trips are looked up by user, city, and summarized in various ways.
  • No downtime for any operation (expanding storage, backup, adding indexes, adding data, and so forth).

The last item in the list was addressing a very immediate pain point. The trips table in PostgreSQL had grown so big that any operation that needed to add a new column or add a new index caused downtime. This made it increasingly cumbersome to develop new features.

We decided that a column-oriented, schemaless approach where data (JSON blobs) are organized in a grid indexed by trip-UUID, column name, and optionally a timestamp would work well as an overall data model. The model lends itself naturally to horizontal scaling by partitioning the rows across multiple shards, and supports our rapid development culture by being schemaless. New columns can be added, and new fields can be added to a column with no reconfiguration.

We evaluated various NoSQL-style databases with the above characteristics. However, we didn’t feel confident they were a good fit for storing our trip data, because of either our operational experience or the product’s maturity.

Inspired by blog posts, such as this one from FriendFeed, we decided to build our own simple, sharded datastore on top of MySQL. The key characteristics of the system we built are:

  • Sharding: Rows are sharded into a fixed set of shards, decided at setup time. Typically, we use 4096. Each shard corresponds to a MySQL tablespace, and the shards are distributed across a number of MySQL servers. Shards can be moved between MySQL servers for load-balancing, and the capacity can be increased online. We typically expand by splitting each MySQL server in two.
  • Append-only (no updates) data model: It only supports an append-only data model where a cell can never be modified after it is written. This is very useful for a system that stores transactional data and wants to guard against data corruption. By being append-only, modifications are naturally idempotent and commutative. The latter means that we can replay updates in any order and get the same result. (We learned later that the append-only style is also advocated by the lambda architecture.)
  • Buffered writes. If the shard where a cell needs to be written to is unavailable (or slow), we write the data to a pending table in any other available MySQL server. These are then later replayed once the shard becomes available. Due to the idempotent and commutative data model, this is always safe and does not require cross-host coordination.
  • Sharded secondary indexes: Indexes can be created on multiple fields in the columns and are sharded on a specific key (e.g., user uuid). They are implemented as MySQL tables and backfilled in the background. In case we need to change the index (e.g., adding a field), we can create a new version, backfill it, and then switch to the new version by changing an index alias, all without application downtime.

The whole system, which we simply call Schemaless in homage to its design, is written in Python. The initial version took about 5 months from idea to production deployment, and we will describe specific implementation details in future blog posts.

 

From SQLAlchemy to Schemaless

Writing a new scalable data store from scratch is a major undertaking. Refactoring crucial parts of a live system from using a PostgreSQL database to leveraging a column-oriented database is a whole different ballgame. Obviously, trip data is an integral part in much of Uber’s backend systems code, so this undertaking would touch most engineering teams.

The major milestones in this part of the project were:

  • Changing trip-ids to trip-UUIDs everywhere.
  • Column layout for trips in Schemaless (e.g., new trips data model).
  • Backfilling data from PostgreSQL to Schemaless.
  • Mirrored writes to PostgreSQL and Schemaless.
  • Rewriting all queries to work with Schemaless.
  • Validation, validation, validation, and validation!

The first task, before we could really get started on the migration, was a trip-id to to trip-uuid migration, as the original code depended on an auto-increment PostgreSQL DB identifier. Several hundred SQL queries needed to be rewritten. These SQL queries were all in the form of SQLAlchemy Python code, and consisted of either explicit or indirect queries through model relationships. These all needed to be rewritten to work on top of the new Schemaless Query API, which is a much more restricted API that does not support joins against other tables in PostgreSQL.

Our initial goal was to remove all uses of the SQLAlchemy trip model and queries directly on the Trips table. Essentially, we wanted to get to the following architecture:

goal-architecture

The lib/tripstore exposed an API that was compatible with a schemaless-based implementation. The lib/tripstore was implemented with a switch so a query could either go through PostgreSQL or through Schemaless. Thus, we emulated the Schemaless API on top of the PostgreSQL data model as refactoring of the code progressed.

As all writes were mirrored to Schemaless, we replayed all queries to Schemaless and verified the results in the background. Thus, we almost immediately started to evaluate that data in Schemaless were consistent with data in PostgreSQL. As validation adds load to the database (which was already heavily loaded), we used a probabilistic approach to control the additional load we put against the PostgreSQL database. A big takeaway here is that the data-modelling, backfill, refactoring, and Schemaless development could all progress in parallel and be continuously deployed in small increments.

 

Execution

rene_mezzanine_move

Our conference room at Global HQ in San Francisco during the Mezzanine migration.

The final crunch of the Mezzanine project took place over a period of 6 weeks. We all got together in a “war room” (except for a few remote folks) and went to work on converting the remaining SQLAlchemy code over to the new tripstore library, completing the backfill, designing and rebuilding indexes, and really dialing up the validation to what some might call “obsessive behavior”.

On the final day, a month before Halloween 2014, we were ready to flip the switch from using the Trips table in PostgreSQL to Schemaless. Team members from all engineering groups were in the SF office at 6 am that day. If something went wrong, all hands could be on deck immediately and we would have a whole working day to fix it. The situation was cautiously optimistic but tense. Much to everyone’s astonishment, it was a non-event. No alerts, no panic. For the Uber platform, it was business as usual. It was just like Halloween, which might look scary, but is just good fun!

 

Lessons Learned

  • Use UUIDs: Always use UUIDs for everything. If you start out with integer IDs and you experience large growth, it’s a lot of tedious work to undo it.
  • Keep the data layer simple: It must be easy to debug and troubleshoot. Performance metrics are especially valuable. Relying on the MySQL engine as the low-level storage layer has allowed us to very quickly build a robust system.
  • There’s a learning curve: The (mental) transition from writing code against a key-value store with limited query-capabilities as opposed to writing against an ORM or SQL directly takes time to master. Idempotency replaces transactions.
  • NoSQL is powerful: The column-based approach with sharded indexes puts the performance tradeoffs of the data layer directly in the programmer’s hands. With the right abstractions, it becomes straightforward to write solutions that not only scale out in the app-server layer but also in the data layer.
  • Finish fast: Do the final migration swiftly and quickly. As features are being developed, it’s always a moving target, so you need to move faster than the rest of the code base.
  • Trial and error: Don’t expect to get the data model right the first time. Plan for multiple complete and partial backfills.
  • Uber On! A positive and can-do attitude makes all the difference when pulling off big team efforts like these.

Since the switch, we have already doubled the capacity of our trip store once, with zero downtime, and implemented numerous performance and operational improvements. On top of this, Schemaless is now used by a handful of other services.

This post is dedicated to the entire Mezzanine Team, who pulled off this amazing effort!

 

By René W. Schmidt, scalability engineer at Uber and technical lead on the Mezzanine Project.

Update: For more info, see René‘s talk at Facebook’s second annual @Scale conference in September 2015.

Photo Credits for Header: “Masai Mara” by Christopher Michel licensed under CC-BY 2.0. Image cropped for header dimensions.

Header Explanation: The annual migration of the Serengeti (in Kenya and Tanzania) is widely regarded as one of the greatest in the natural world.

0 Comments