Migrating Functionality Between Large-scale Production Systems Seamlessly

A common axiom among Uber engineers states that building new features is like fixing a car’s engine while driving it. As we scaled up to our present level of support for 14 million trips per day, the car in that axiom got upgraded to a jet airliner. The truth is, with global operations in over 700 cities and people depending on us for transportation and earnings, there is no downtime. New features and system upgrades must go into a production system.

In 2018, Uber’s payments engineers migrated the functionality of dozens of complex payment instruments while simultaneously keeping existing payment functionality running for hundreds of millions of users. It was a difficult, but rewarding technical challenge that serves as a solid example of how we operate in a production environment.

The processes we used for this migration, and the lessons learned, can apply to engineers working on any large-scale systems that can’t afford downtime.

Background

When Uber introduced upfront pricing, we started separating out the concept of an authorization and a charge. Since riders may change their destination or cancel the ride request, it doesn’t make sense to charge at the start of the trip. Instead, we authorize the rider’s credit card for the fare amount and hold the funds until the trip is completed. Once the trip is finished, the credit charge is processed. 

Instituting this payment feature required changes to eight microservices in our backend. Over the ensuing years, we added new payment instruments to make using the Uber platform more convenient for users, and launched newer lines of business such as Uber Eats, Uber Freight, and Jump electric bikes and scooters. We quickly realized that it became an unsustainable use of development time to keep updating all of the services to handle each new use case.

To this end,  we initiated a project in 2018 that would allow authorization hold logic to be written once and used across our existing and future products. However, one requirement for this project was that we needed zero downtime as we migrated to this new system.

Pre-migration

As Uber’s tech stack evolved from its original monolithic codebase to our microservice-based architecture, we became well-versed in conducting large migrations. These previous experiences taught us that, for a successful migration, we need to have the following:

  • Shadowing: Forward current production traffic to the new system and observe its behavior
  • Validation: Validate the correctness of the new system

Shadowing

Shadowing production traffic to our new Auth service gave us a vote of confidence that after or during the migration process the production traffic will be handled correctly without any regression (addressed in the Validation section). Also, shadowing production traffic to a new service gave us a rough estimate of the requests per second (RPS), traffic patterns, and other performance statistics. 

All original production requests were forwarded to the Auth Service with a programmed delay using dedicated Celery queues. The advantage of using dedicated Celery queues is that they do not impact existing jobs in the production flow. Additionally, since our legacy service was written in Python, Celery’s native language, it was an easy choice.

Validation

Before beginning the migration, we validated that the new Auth service will handle production traffic correctly without any regressions. Validation depends on a lot of factors, and the goals we wanted to achieve were:

  1. The request arguments we pass to payment service providers (PSPs) from the Auth service has to be the same as the legacy service.
  2. Auth services reach the same state as the legacy service based on the response from the PSP.

We only needed to validate that we should pass the same arguments and implement the functions in the same way. Doing the latter part is especially hard when dealing with a different programming language, a different storage model, and other factors. We asked ourselves: do we only need the functionality to be the same or the entire function implementation? We decided in favor of the former, that we only needed the functionality to match the legacy service.

Another goal was to provide a new generic interface to the rest of Uber for Auth, Void, and Capture calls to PSPs. Additionally, we needed to manage the state and life cycle of our Auth service based on the response. To do this, we ensured that we passed the same arguments to the PSP’s endpoint while doing the Auth, Void, and Capture calls. Based on the response, the AuthHold state needed to be updated in our new database.

Since a PSP is external, we don’t want to make the same request in our legacy and new systems. It would double the traffic, the behavior of the PSP might be nondeterministic, and it would not be clear which was the primary system for the request.

As described above, given a function(f), if the inputs are the same then the output should also be the same. This is the approach we took for validation too. When the legacy system made a request, we cached both the request and the response from the PSP with the intent to replay the traffic to the new Auth service later. For our cache, we used Redis, an in-memory data structure store, and keyed the values based on ‘namespace’+UUID, below. We chose Redis due to its fast reads and the fact that, in this case, storage doesn’t need to be highly durable.

Hash(‘namespace’+UUID), where a namespace is a request or response

Leveraging the shadow work we did, we added a slight delay to the requests shadowed to the  Auth service. Based on the request, the Auth service constructs its own request arguments to call the PSP’s endpoint, and our Payment-Integration component will validate the request arguments against the original request arguments. Also, as we can’t call a PSP’s endpoint twice, the response will be replayed from Redis for the particular request and the AuthHold state will be updated in our Auth service database, as shown in Figure 1, below:

Figure 1: Shadowing traffic through new components in our payment system lets us validate the outputs without affecting the operation of our production system.

 

We also needed to validate that both the old and new systems reached the same AuthHold state based on the response.

After running the validation for a few days, we found dozens of argument mismatches across different PSPs and edge cases for state transition. Through multiple iterations, we fixed the validation errors and state transition errors and kept the validation running until we migrated everything to the new system. This validation step gave us confidence and proof that the new service could handle the production traffic correctly without any regressions.

Roll out

Once we were confident in our validation metrics, we were finally ready to put the new system into our production flow. First, we built A/B dashboards to monitor business-level metrics of the new system as compared to the legacy system. We checked these dashboards every day. We also decided that it would be best to roll out one payment method at a time to account for additional differences.

For each payment method, we began with a test plan for a couple of employees to test known success and failure cases. Then we moved to put all Uber employees on the new system, followed by incremental cohorts of real-world users, going from .25 percent, 1 percent, 5 percent until we were 100 percent rolled out.

Lessons learned

Migrations are a fact of life at Uber and, we imagine, most technology companies. Every tech stack is different, but the lessons we have learned should be generally applicable, especially when downtime is not an option. Our key learnings include:

  • Tech debt: Push the team to fund any technical debt incurred over the years. It helps the team move faster in the future and improve developer productivity.
  • Trial and error: Don’t expect to get the validation right the first time. Plan for multiple iterations.
  • Data analyst: Having a data analyst as part of the team while doing any migration will help in finding issues early, especially in the payments world.
  • Finish fast: Final migration should be quick, with no opportunity to turn back. Options to roll back to the legacy system will likely be misused, preventing the migration from ever completing.
  • Migrations are a long tail: For consumers to adopt a new data model, all data needs to be available. We needed to finish development before other teams at Uber could begin their work.

If working on systems in a dynamic production environment that has real-world impact interests you, consider applying for a role on our team!

Lead image of orcas by Skeeze from Pixabay.

Comments