By Oscar Wang
With over 700 locations worldwide, Uber’s Greenlight Hubs (GLH) provide in-person support for driver-partners for everything from account and payment issues to vehicle inspections and driver onboarding. To create better experiences for driver-partners and improve customer satisfaction, Uber’s Customer Obsession Engineering team built an in-house customer support system, a solution that has led to more streamlined and quicker support ticket resolution through GLHs.
This customer support system consists of two main features: a check-in queuing system for our service experts to keep track of partners coming into GLHs and an appointment system that lets partners schedule in-person support appointments via the Uber Partner app. Launched in March 2017, these tools have improved support experiences for partners across the globe.
Transitioning to an in-house solution
As Uber grew, our previous customer support technologies could not scale to provide the best experience for our partners. By building our own GLH customer support system, we created a solution that was both tailored to our needs for scalability and customization, and improved our existing infrastructure to support new features.
Developing our own tools meant we could facilitate:
- Easy access to information for customer support: Our check-in system makes it easy for support representatives to access the appropriate information needed to address partner concerns. This integration helps decrease support resolution times and improve partner experiences with GLHs.
- Aggregation of partner communication channels: Centralization for Uber’s various support channels, including in-app messages, the GLH itself, and phone support, means GLH experts have additional context to help resolve partner issues all in one place.
- Shorter GLH wait times for partners: With our updated system, partners can schedule appointments to avoid unnecessary wait times during high-peak hours.
To meet these goals, we built two new tools for our in-house customer support platform: check-in queueing and an appointment system.
A more seamless check-in experience
We created a more seamless support experience for partners by designing and implementing our own real-time check-in system on top of our customer support platform. Using this system, partners check in with a concierge who then finds the partner’s profile based on the phone number or email address associated with their account.
Once a partner checks in, a GLH expert selects them from the site’s queue. The partner then receives a push notification on their phone, as well as a notification on monitors within the GLH, that they have been paired with an expert. Once a partner meets with their expert at the support station specified in their notification, the partner drops off the check-in queue.
Our real-time check-in system also aggregates customer information, such as past trips and support messages, equipping our experts to solve the issue at hand as effectively as possible.
Providing real-time expert queues
Creating this real-time check-in solution came with some difficulties. One challenge we faced was preventing expert collisions, a scenario during which experts claim partners who are already being assisted. To accomplish this, our system needed to provide a queue of partners (referred to as our GLH Site Queue) that are waiting for support, through which experts can be paired with waiting partners and notify the partner in real-time when they have been selected.
The WebSocket protocol allows persistent connections with low latency, so we leveraged it to send queue updates via our backend. Go, our language of choice for many of Uber’s back-end services, made this easy by letting us use channels and goroutines to pass real-time updates to web clients.
Nevertheless, our use of WebSocket introduced some interesting routing challenges. For our site queue to work in real time, we decided to keep all of our WebSocket connections and queue writes for a specific site on a single host. This way, when one of the check-ins or appointments in the queue is updated, all of the relevant connected clients are updated too. Using a single host to handle these requests required sharding on the application layer prior to our writes and connecting WebSocket to a host.
We used Ringpop-go, our open source scalable and fault-tolerant application-layer sharding for Go applications, which helped with configuring sharding keys so that all requests with the same keys would be routed to the same host. For our sharding key, we used the GLH Site ID, so all check-ins happening at the same GLH would go to the same host, and update all site queues on the relevant clients.
Achieving high reliability across data centers
To ensure our GLH software runs smoothly, we need to maintain high availability. To make this happen, our service runs across multiple data centers, handling requests from all over the globe. If one data center goes down due to some unforeseen reason (like an outage), the service will recover itself and continue to operate from other data centers.
Given our use of WebSocket, running the service in multiple data centers came with its own set of difficulties. We had to reconsider how to handle WebSocket gracefully in the event of a data center failure. While Ringpop sharding works well across data centers, it would increase latency due to sending cross-data center requests every time hosts leave or enter the ring.
To address WebSocket degradation, we configured our system so there is a ring in each data center; this way, if two requests with the same unique GLH ID hit two different data centers, it would only update the site queues in the data center where we host the site queues. We forward all our requests to a single data center, regardless of which data center the request came from. In the event of a data center failover, we forward the requests to another data center. We also kill all the WebSocket connections with the original data center and re-create connections with the new active data center.
To decrease wait times at GLHs and ensure that we have ample support during peak hours, we rolled out a new feature that lets our partners schedule a GLH appointment in advance with just a few quick taps on the Uber Partner app.
Even though appointment scheduling is simple for partners, there is a lot going on behind the scenes to make the process as seamless as possible. For example, GLH managers can specify how many experts are working at their hub at any time to ensure that their teams are not overbooked; then, when a partner goes into the app, they only see availability based on expert capacity. For instance, if only four experts are working at a given GLH at 9 a.m. on Tuesday morning, then the hub’s manager could set a capacity of four appointments at that time, limiting the amount of available appointments.
When a partner schedules an appointment, they show up under the GLH’s list of appointments for the day. When the partner arrives for their scheduled appointment, they can easily check in through their app, notifying the assigned expert of their arrival. Building our appointment system included implementing a scheduling system on the backend, adding appointment capabilities on mobile, and developing a browser-based calendar interface for our GLH managers.
Building a global scheduling system
Inspired by Martin Fowler’s paper on recurring calendar events, we decided to build our scheduling system with a core calendar service, specifically implementing available time intervals (calendar intervals for simplicity) that are treated by a system as rules to account for these specifications.
In Fowler’s model, these rules can be specified and modified by GLH managers, thereby allowing for more flexible scheduling. Because scheduling systems in general have many edge cases to consider, we built it incrementally to avoid scope creep and provide a functioning system at every step:
- Our first iteration used the business hours initially set by GLH managers and specified a global capacity of three experts for each site, allowing us to slowly roll out a beta version of the software.
- Our second iteration used calendar intervals set by GLH managers, allowing them to customize expert capacity per interval.
- Our third iteration incorporated the existing calendar intervals but also allowed GLH managers to set times when the GLH was closed (i.e., non-business hours and holidays).
However, due to Uber’s international presence, we quickly started running into time zone-related issues, exacerbated by the fact that various components of the system needed to coordinate which time zone context was being used, e.g. the GLH time zone or partner’s time zone. In addition, we needed to account for daylight savings changes. To address these demands, we adopted the following set of rules:
- All clients interacting with the main back-end service API assume the time zone of their chosen GLH.
- All appointment times are persisted in our database as Zulu Military time, i.e. Z (UTC +0).
- The main back-end service has an internal layer which handles all time zone conversions between the persistent and API layers. This allows us to abstract out calendar logic and call internal calendar-related methods without worrying about time zone issues.
It is important to note that the time zone, i.e. UTC offset, is not stored as a property of a GLH object. If that was the case, the daylight saving time change would cause previously scheduled appointment times to shift an hour in either direction. To handle this properly, UTC offsets are dynamically calculated from each GLH’s physical coordinates.
Time zone edge cases
While building our scheduling system, we ran into a couple of interesting edge cases regarding time zones. One issue arose when our system converted “calendar intervals” to a local time zone. Due to the time zone change between UTC and local time (depending on the time zone of the site in question), the date might be incorrect. For example, 5:00 a.m. UTC on November 20 is actually 9:00 p.m. PST on November 19. As a result, it was important for us not to make assumptions about the date of the relevant time slots and to test cases when time zone shifts cross days.
In addition, when converting our GLH business hours from UTC time to local time, we had similar time zone issues. We saved our business hours in local time because, without a date, there was not enough context for us to save it in UTC. For example, a GLH that is open from 9:00 a.m. to 9:00 p.m. on Monday in PST would result in UTC business hours that start at 5:00 p.m. Monday and end at 5:00 a.m. on Tuesday morning. Since there is no date, it is unclear what day of the week these hours refer to in local time. As a result, we had to convert the open hours from the stored local hours to UTC hours whenever new calendar intervals were created. Depending on where business logic resides, these scenarios might need to be tested extensively on web and mobile clients, as well as server side.
Working with datetime libraries on mobile
For partners to actually use our scheduling system, we needed to build a new UX for mobile. This involved modifying the support form screen to give partners an option besides a submit button to get help, as well as the help home screen to show any upcoming appointments they might have.
There were also a few new screens that correlated to specific activities: picking which nearby GLH to book an appointment, selecting a particular date and time for an appointment given the available options for that site, confirming selections to create an appointment, viewing details of a booked appointment with the option to cancel it, and viewing details about the site, like its address.
Since we were dealing with dates and times, and because we wanted our server API to return structured data (e.g., ISO 8601) rather than pre-formatted localized strings (i.e., dates in the user’s language of preference) for us to display, we assumed we would make use of the java.util.Date standard. In this standard, the Date and corresponding calendar classes have a number of known issues when dealing with time zones, so we wanted to explore if some other options might work better for us. For instance, the Joda-Time standard, a Java 8 API, sounded interesting, but it was not yet compatible with Android, a popular OS for partner devices.
We eventually found ThreeTenBP, the successor to Joda-Time which brings the Java 8 time and date APIs to Java 6 and 7. However, prior attempts to use ThreeTenBP on Android suffered from start-up issues. On start-up, the libraries loaded the time zone database information from disk, parsed it, and registered it with the library to use later. Android-specific wrappers of this library loaded the data in a more Android-friendly way, but there were still non-trivial disk operations blocking app startup. When testing on a low-to mid-range device, this slowed down start-up of the Uber partner app by over 200 milliseconds.
We tried optimizing ThreeTenBP in multiple ways, for example, by performing the actual disk operations on a different thread so that the rest of Application.onCreate could happen in parallel and join threads at the end, thereby ensuring that the Uber Partner app could safely use the library. We also tried using other similar libraries which attempt to do little or no I/O at start, but could not get the start time penalty down to a reasonable latency.
We tried using a method profiler, and to our surprise, a ton of time during start-up was spent in string methods that are commonly seen by parsing code, like String.split. According to our reading of the code, and even our step debugger from Application.onCreate, there did not seem to be any way this could be happening. In the profiler, the heavyweight operations rolled up to the static initializer in the ZoneRulesProvider class, where the (theoretically) lazy time zone database provider code was being registered. Since this class was being loaded to do the registration, even if the object being registered was completely lazy and not doing any I/O on registration, the static initializer block was being run in an attempt to load the time zone database from ServiceLoader/META-INF. This is a pattern typically seen in Java servers, not on Android, and it uses the same resource loading that we were trying to avoid because of its slow performance on Android.
We ended up modifying ThreeTenBP itself so that the behavior of this static initializer block can be easily overridden. The default implementation would remain the same but be abstracted behind a new ZoneRulesInitializer class. An Android app or library would be able to provide its own implementation that would load the time zone database through Android assets on the first usage of the library.
We updated lazythreetenbp, another ThreeTenBP wrapper intended for Android, to take advantage of this new interface as the equivalent update to ThreeTenABP is pending. Now, the start-up penalty of using this library is zero, resulting in low latency. However, having the time zone database loading occur in the static initializer block means that nothing needs to happen until time zone data is required, which may not even occur during a typical user session. (Uber apps are very large and few features need to manipulate dates and times in a way that requires time zones).
We also built a calendar app for GLH managers to easily and flexibly configure their sites’ business hours, available hours for appointments, and available experts during any given hour. Available hours can be created only during business hours. Closed hours in the calendar are greyed out. The calendar also displays currently scheduled appointments.
In a week view of the calendar, site managers can drag-and-drop from the start time to the end time to create available hours. In addition, they can also add closures such as holidays and lunch times in the mobile app, thereby preventing site managers from accidentally adding available hours during site closures.
In our beta version of the software, a number of elements in the calendar were re-rendered whenever the calendar was dragged. Consequently, the hour range was dynamically displayed even though most of these elements had no visual updates. Since many DOM elements in the calendar are rendered, we leveraged the Virtual DOM of React by tweaking the shouldComponentUpdate() lifecycle method to reduce the number of elements requiring rendering.
We then checked whether the elements in the calendar are within the range of the start time and end time by using drag source of react-dnd and re-rendered only those elements. Additionally, we made DOM elements of closures and available hours not updatable because they do not allow overlaps, slightly increasing the performance. As a result, the 200 milliseconds delay caused by updating during the drag-and-drop process was reduced, making it close to 0.
The future of in-person support engineering at Uber
Building this product helped improve our partners’ experiences at GLHs, leading to higher customer satisfaction. Transitioning to our new system has reduced the average wait time for walk-ins by over 15 percent and, once matched with a customer support expert, issue resolution time for partners has decreased by 25 percent. Best of all, these new features have made GLH wait times nearly non-existent for partners who schedule their appointments!
This is just a glimpse of what we have in store for our partners and customer support experts around the world. We are continuing to explore new technologies to improve the GLH experiences of our users, from revamping our analytics to proactively providing support before partners even file a ticket.