How Uber Thinks About Site Reliability Engineering
Uber’s mission is transportation as reliable as running water, everywhere, for everyone. Site reliability engineers (SREs) ensure this reliability, which can make or break any company’s reputation. For Uber, it’s particularly pertinent; people remember even the slightest moment of unreliability, and they talk about it.
This past month, we talked about what it takes to get site reliability engineering right. On Feb. 2, Uber hosted an evening of talks at our San Francisco office headquarters covering what SRE means for Uber and what it has meant for Google, an early pioneer in advancing this engineering role over the past decade.
Engineering for the Long Game: Managing System Complexity Over Time
Our keynote speaker was Astrid Atkinson, who described how she’s thought about site reliability engineering at Google. Astrid’s talk covers Google’s SRE history and the lessons she’s learned along the way:
Astrid Atkinson, Google Site Reliability Engineering. In 2004 Astrid joined the original Site Reliability Engineering team at Google, known at the time as the Production Team. As the SRE organization grew, she became the manager of the SRE team responsible for the Google Webservers where she carried the pager for google.com for five plus years. Astrid spent some time managing the Site Reliability Engineering team for Google’s social products before moving to the Cloud Platform team where she led the development of the next generation Google App Engine. Today Astrid is an Engineering Director, leading some of Google’s core Search Infrastructure teams.
A History of Site Reliability Engineering at Uber
We next heard about Uber’s site reliability engineering beginnings and growth over the past year:
Rick Boone, Uber Site Reliability Engineering. Rick has been an SRE at Uber for just over a year. Before Uber, Rick spent three years at Facebook as a Production Engineer (what Facebook calls SREs). Prior to that, Rick worked at several startups in LA, including BuzzMedia.
Dealing with Scale: Shaping Reality and Repeatability
Next up, two other Uber SREs discussed dealing with hypergrowth and Uber’s changing engineering culture as reliability has become extra emphasized at scale:
Shella Stephens, Uber Site Reliability Engineering. Shella is an SRE working on Uber’s migration from bare metal to the cloud, and has been instrumental in our datacenter buildouts. Shella has been with Uber for almost two years. Previously, Shella was an infrastructure engineer at LinkedIn where she participated in moving LinkedIn to an all-active architecture.
Tom Croucher, Uber Site Reliability Engineering. Over the past two years at Uber, Tom has served as a manager and as a senior lead engineer, building frameworks our developers use every day. Most recently, Tom led our New Year’s Eve 2015 infrastructure preparation efforts. Tom previously served as CTO of Change.org, in addition to holding engineering roles at several other companies. Tom has co-authored several books including O’Reilly Media’s Up and Running with Node.js.
Observability at Uber Engineering: Past, Present, Future
Finally, we heard about how data and outage detection will have an increasing impact on the SRE role at Uber in 2016 and beyond:
Fran Bell, PhD, Uber Data Science. Fran has been an Uber data scientist for about one and a half years. Before Uber, Fran developed novel approximate quantum dynamics theories for large biomolecules as a postdoc at Caltech. She is cited in over 300 science publications and turned down an assistant professorship at the Johns Hopkins University in Baltimore to join Uber.
Akshay Shah, MD, Uber Data Engineering. Akshay has been a data engineer at Uber for about one year. He previously worked as a full stack application developer at a handful of SF startups, and prior to that became a medical doctor and was a public school teacher before deciding to switch careers.
We have many projects underway on the rigorous road to reliability, and we need contributions from amazing people to get there. If Site Reliability Engineering at Uber interests you, have a conversation with us about the projects we are working on and opportunities for you to become part of our efforts toward four nines.
Chris Adams is a Site Reliability Engineering Manager at Uber. He originally hails from Australia and has held a variety of software engineering roles there and in the San Francisco Bay Area.
Sign up for our Uber Engineering Events Meetup group to learn about future talks about our technologies.
Like what you’re reading? Sign up for our newsletter for updates from the Uber Engineering blog.