By Marissa Alvarado-Lima, Stanley Chan, Chris Duarte, & Ed Wolf
Two years ago, Uber’s previous chat application began showing signs that it would not be able to adapt to our growth. There were app crashes, performance hiccups, and outages that crippled our company’s ability to effectively communicate online. With user satisfaction at an all-time low, we needed a new solution.
With operations in over 620 cities, it was paramount for us to identify a chat solution that would enable Uber employees to reliably communicate on desktop and mobile regardless of where they were in the world. To accomplish this, we established a few core requirements. To start, we needed something that could scale to support our growing employee population and, as a byproduct, control costs. We also needed a platform that could easily integrate with a variety of internal engineering, business, and operational tools.
While we evaluated Internet Relay Chat (IRC) and many other popular chat clients, it became clear that there was no turnkey third-party solution able to meet Uber’s core requirements.
So after testing multiple off-the-shelf alternatives, we built uChat, our custom in-house messaging platform, by leveraging open source platform Mattermost and Puppet, the Uber standard for deployment configuration management. In this article, we discuss how in just three months our team transitioned the company to a new solution capable of reliably delivering over one million messages per day to tens of thousands of users, all in one unified chat environment.
Going open source
To expedite the development process, we decided to build our solution on top of an open source platform. This option was particularly appealing given our desire to directly influence roadmaps and control long-term recurring costs, which would have been more difficult if we had started from scratch.
As part of our performance validation, we pushed each open source product to their breaking points. For example, one of our initial goals was to simulate the chat behaviors of 50,000 users in a single chat environment. Right away, many of these platforms were unable to perform these basic load tests.
With the remaining options, we wrote a test harness to simulate thousands of users. The test suite’s design (Figure 1a, above) was extremely simple; this ensured that product claims could be validated with minimal effort on our end. In this testing environment, the simple Go struct EntityConfig (Figure 1b, below) encapsulated a user that could log in, join channels, and send messages. Combined with Go’s native coroutines, it was easy to hit our simulation goals. Over time, the test harness grew to offer greater functionality and better simulate complex user scenarios.
After months of testing and vendor vetting, we zeroed in on Mattermost. Mattermost exceeded our minimum performance expectations and aligned nicely with Uber’s existing tech stack. Furthermore, their client user interfaces were similar to those of chat applications popular among Uber employees.
The next logical step was to assess whether or not the platform could both keep up with Uber’s hypergrowth and sustain our operational dependency on chat. After careful consideration of the necessary factors, we asked ourselves: could Mattermost reach Uber-scale?
To test Mattermost’s ability to adapt to these conditions, our load tests sought to identify major bottlenecks in architecture and server code that might impact performance. We simulated variable message send rates, ramp up windows, and concurrent user spikes to establish a baseline. Thousands of fake user accounts were created and programmed to log in, join rooms, and chat furiously with the goal of triggering crashes and database locking.
During testing, we were unsurprised to find that it was not just the back end that exhibited limited scalability. Most of the user interfaces we tested could not meet our aggressive targets: for example, chat rooms could only house one hundred or so occupants, and existing user interfaces were unable to search over 20,000 channels simultaneously.
As we encountered bottlenecks during our scale tests, we targeted fixes for the most impactful performance offenders and repeated the process from known stable load levels. At the same time, we continued to update our tests to mimic increasingly realistic use cases. Our initial target was 70,000 concurrent users with a send rate of 80 to 200 messages per second. If this load was managed successfully, we could ensure an ample runway for future growth.
Working with the open source community, we continuously dissected logs, identified root causes, and issued new fixes; turnaround speed was paramount. With each new discovery and build, we developed a deeper sense of the overall system limitations and ability to quickly adapt to scale demands.
Over time, it became clear that Mattermost was most capable of maturing at Uber-scale, and we had gained confidence in their platform. To help others create their own chat platforms using our new partner-in-development, we contributed our test harness work to the open source community.
Managing the uChat architecture with Puppet
While load testing the Mattermost platform, we needed to continuously and reliably make changes to our new solution’s infrastructure and application configurations. Mattermost’s high availability mode was still in early beta, so we often encountered bugs in our code.
During early testing and development stages, the Uber Infrastructure team made these code changes by hand. Stabilizing new uChat builds and server configurations was a predictably tedious and regression-prone process, requiring frequent changes to fix unintentional configuration mistakes. When it was clear that manual infrastructure administration was slowing us down, we turned to Puppet.
Encoding architecture and server configurations with Puppet provided a solid foundation for managing the various server and network environments required to support rapid uChat improvements. Puppet enabled consistent and repeatable changes across the many machines in our database topology and allowed us to audit changes and enforce deployment code reviews.
The initial introduction of Puppet greatly reduced unintentional errors during deployments. However, even with basic Puppet infrastructure-as-code solving some of our repeatability and consistency problems, we still had not achieved the velocity we needed. At this stage, we were not using immutable servers, which meant that changes to the Puppet code for our highly available architecture came with the considerable risk of unintended side effects. Additionally, we were using one, large monorepo where all Puppet-related changes had to converge, adding to these complications.
This constraint meant that we could not deploy new infrastructure and application versions quickly enough for our needs, severely limiting our rate of iteration. As a result, updated Puppet deployment configurations might take a day or two to safely complete. Meanwhile, unrelated uChat performance improvements were shipping hourly. Our new Puppet-based workflow was better, but there was still room for improvement.
To increase the uChat deployment speed, we transitioned our infrastructure code to a separate repository as a Puppet module. This would allow us to independently version, isolate, and test uChat application-specific configuration changes, irrespective of server infrastructure configurations.
After refactoring and thoroughly testing our new Puppet module, we relied on Puppet Code Manager to push infrastructure changes to newly minted staging and production environments. This module-based approach allowed us to separate application configuration concerns from infrastructure/server configuration concerns, more easily distributing changes based on the needs and release timeline of the uChat application itself.
With Puppet modules, Puppet Code Manager, and support for A/B deployments, we finally achieved the rapid and consistent deployment change control we needed. Now, the recreation of our entire staging and production environments could be reliably accomplished in less than a day.
Creating a seamless user experience
In parallel to maturing our infrastructure, we also had to build a suite of intuitive, Uber-fied web, desktop, and mobile applications. To give us a headstart, we forked the Mattermost desktop applications and began a straightforward white labeling of our internal clients. The Mattermost mobile apps, on the other hand, needed a complete restructure to achieve the type of user experience (UX) we were looking for.
The original iOS and Android uChat apps incorporated a simple webview to load the entire uChat webpage, causing extremely slow load times. Additionally, the webview caused refreshes that forced unnecessary loading, incomplete file downloads, and uneven experiences. To augment our staff and accelerate our build, we partnered with Fullstack Labs to re-write these apps in React Native and, as we did with our Mattermost-enabled platform, contribute everything we built to the open source community.
Currently, uChat mobile clients are deployed through our internal app store, with the desktop clients being managed via Chef and System Center Configuration Manager (SCCM). To both limit the number of mobile uChat versions in the wild and simplify ongoing support, we built custom mechanisms within the iOS and Android apps that prompt users to upgrade when a new uChat version becomes available. We also engineered the ability to invalidate old client versions in the event that we need to force an upgrade.
Our extensibility-minded vision for the platform ensured that we provide an integrated UX with other tools in our ecosystem. To accomplish this, we spent several months transitioning over 50 legacy integrations from our previous solution to uChat. APIs and webhooks allowed existing internal services to integrate and extend the capabilities of chat beyond one-on-one conversations, such as Uber’s internal deployment system, uDeploy, which notifies engineers when a new build is completed, or Envoy, which supports our office visitor registration services.
Building trust with users
When implementing any enterprise application, driving user adoption and changing existing behaviors can be challenging. While shifting the company onto a home-grown tool like uChat, extra care was necessary to instill confidence that uChat would be more reliable than the incumbent messaging platform. To establish this confidence and make the transition to uChat as seamless as possible, we pre-provisioned all employees with accounts and migrated nearly 20,000 chat rooms so that employees would not have to recreate or re-join any of the rooms they previously worked in.
Despite our best efforts, there were road bumps. As we integrated more and more employees onto uChat, intermittent outages made some early users reluctant to fully adopt the application. To mitigate concerns, we kept our legacy chat online while uChat stabilized and had more time to establish itself.
To further ingratiate uChat with employees, we doubled down on transparency. We acknowledged bugs as they were discovered, communicated plans for remediation, and deployed patches. We also made monitoring graphs widely available so that anyone interested could see uChat’s real-time health.
Once we could consistently demonstrate 99.9 percent availability with uChat, we depreciated our legacy chat system. The full turndown was accomplished by slowly disabling features such as room creation and integrations until we pulled the plug on the legacy system entirely. We regularly announced product updates, shared tips, and assembled resources like FAQs and user guides to ensure that employees felt fully supported. We also opened a feedback channel where employees could receive immediate assistance and suggest new features.
One key lesson learned during our experience was that the level of planning and organizational alignment needed to turn down an incumbent chat application required far more buy-in and coordination with our users than we initially anticipated. But by remaining transparent, accessible, and quick to deliver improvements, we established a foundation of good rapport with early adopters and, over time, transitioned the entire company.
The future of uChat
In the coming months, we intend to implement more features, continue monitoring user feedback, and incorporate additional improvements offered by Mattermost, Inc. and their open source community.
uChat is just one of the many solutions our team built to foster collaboration, communication, and productivity at Uber. We are constantly developing, deploying, and maintaining Uber’s in-house toolkit, saving our company millions annually in reclaimed productivity and software licensing. If engineering next-gen internal systems for a rapidly growing global company like Uber sounds interesting, consider applying for a role on Uber’s Employee Productivity Tools team.
Marissa Alvarado-Lima is a technical writer on Uber’s Technology Services team, Stanley Chan and Chris Duarte are software engineers on Uber’s Employee Productivity Tools, and Ed Wolf is a senior product manager on Uber’s Employee Productivity Tools team. Software engineers Sameer Patan and Josh Schipper, senior site reliability engineer Matt Beard, engineering manager Benjamin Booth, and product designer Nayong Park also contributed to this article.