Uber Open Source: Catching Up with Felix Cheung, Data Platform Engineering Manager

Uber Open Source: Catching Up with Felix Cheung, Data Platform Engineering Manager

It may seem counter-intuitive that open source software, with its public availability and transparent code, can give private enterprise an edge. But that’s the lesson provided by Felix Cheung, an engineering manager at Uber who puts significant effort into The Apache Software Foundation.

Felix CheungBased in Uber’s Seattle Engineering office, Felix works on our Data Platform team, building and maintaining the infrastructure which processes the over 100 petabytes of data that power Uber’s services around the world. Felix also serves on the Technical Steering Committee for the Uber Open Source Program, which provides technical oversight around open sourcing software and contributing to open source software projects. Open source projects such as Apache Hadoop, Apache Kafka, and Apache Spark play a big part in this infrastructure. Likewise, Uber’s own contributed projects, such as Peloton and Marmaray, make this infrastructure possible, as does Apache Hudi (incubating), software developed at Uber and donated to the Apache Software Foundation.

Along with his work at Uber, Felix serves on the Project Management Committee for Apache Spark, Apache Zeppelin, and the Apache Incubator, and was recently elected as a Member of The Apache Software Foundation. While some of this work overlaps with technologies used by Uber, Felix contributes his time to the betterment of a wider array of software projects, exemplifying the spirit of the open source software movement.  

We sat down with Felix to find out about his work and his involvement with open source:

What is your background in engineering, and what was your journey to Uber?

I have been working with Big Data open source projects for about six years. My first venture into Big Data came when I was tasked with improving end-to-end reliability by analyzing logs and call flows for my former company’s infrastructure. After a bit of initial success, I was very fortunate to have several opportunities to do more with the latest open source frameworks, like Apache Spark. Over the years, I continued to improve my skills and got exposed to different perspectives on solving Big Data issues.

My background and history with open source gave me more opportunities, and I could not be more thankful for how I was able to immerse myself and keep up with how the technology evolved. Eventually, the sum of all those experiences led me to Uber, where I’m leading the Core Data team in Seattle, part of our Data Platform organization.

What attracted you to open source engineering?

I love the community. I do most of my work with The Apache Software Foundation (ASF), serving as a Project Management Committee member for Apache Zeppelin, Apache Spark, and Apache Incubator, and helping with other projects when I have time. In particular, I like how ASF values community and collaboration. ASF folks have sayings like “community over code” and “merit never expires,” which reflects the The Apache Way, as it is affectionately known.

Of course, there are many open source communities besides ASF. Over the years, I have been involved to varying degrees with other open source groups. For example, last year I joined the Kubernetes Big Data SIG.

I love seeing how driven people are in the open source community. Literally, they are spending hours of their own time to help and improve projects for other engineers. Many of these projects are used in world-class technology platforms, and community members are innovating them at a staggering pace.

It’s great to be able to work with these smart, passionate people from all around the world through open source projects.

In what ways are you currently involved with open source engineering?

Our Data Platform uses a number of ASF projects. Off the top of my head, I can think of 13 projects we use. I think the sheer size of the community and their extensive contributions, along with the process guidelines around releases, gives us some of the quality assurance that we as a tech company depend on.

We actively engage and collaborate with the community to share ideas and experiences. I have organized and participated in a few open source special interest groups. I find this extremely valuable because as much as we love the pace of innovation in various projects, we also highly value reliability, and Uber comes with some unique challenges at its enormous scale.

In my day-to-day work, I am fortunate to lead a talented team of engineers who love working with ASF and other open source projects. They are equally excited about embracing the open source community.

On the personal side, I work with several core projects, nowadays mostly around helping to review pull requests, trying to give good feedback, and merging changes. I mentor a few projects in the Apache Incubator. I also organize local meetups and sometimes go to conferences to talk about my experiences with open source projects.

What has been your experience with the Uber open source community?

Honestly, it’s been amazing. The Uber tech team values open source, whether it involves leveraging existing projects in many areas of our tech stack, sharing our experience, or open sourcing our own projects. Some great examples are Horovod, Pyro, and, more recently, Marmaray, Ludwig, AresDB, and Peloton. There are over 200 projects on our open source GitHub. I also want to give a quick shout-out to jvm-profiler, which my team open sourced last year and is supported by a small but growing community. Oh, and last but not least, Hudi, for getting accepted into the Apache Incubator.

Last year we established an official Uber Open Source program, which has now evolved into the Open Source Program office. I work with them a lot, organizing meetups and events, and serving on the Technical Steering Committee, figuring out the best practices and protocol for sharing our work.

How does Uber integrate open source with its Big Data platform?

A number of ASF projects are particularly useful for us. In many cases, we have to extend these projects to adapt to the needs of our architecture. Where we make fixes or improvements, we try to do it in a way that is not specific to Uber and then upstream these changes after validating them internally. That’s the beauty of the Apache v2 license—we can make changes to the software so it works for our needs while keeping in mind our responsibility to the larger community.

We also treasure collaboration with other companies through open source projects. I find that conversing over the code in an ASF Git repository or on GitHub makes collaborating as an engineer much easier than it was five years ago.

What are some of your notable contributions to open source projects?

I collaborated with LinkedIn’s Core Data Infrastructure team for a few years. For most of last year, we, along with members of the Apache Spark community, worked to improve the scalability and reliability of Apache Spark, including working on a design proposal to disaggregate data shuffle. We have learned a lot over the past year.

I have also been working with Anne Holler, from Uber’s Machine Learning platform team, to propose a design change in Apache Spark for ML model online serving. Michelangelo, Uber’s machine learning platform, leverages Apache Spark quite heavily for large-scale distributed machine learning model training.

Are there any newer projects you are really excited about?

XGBoost is really cool. It’s a very popular machine learning library for optimized distributed gradient boosting. InfoWorld included it in its Technology of the Year awards for 2019, and it’s the most popular non-deep-learning machine learning library in the industry today.

Aside from ASF projects, XGBoost is another project that my team is investing in. Nan Zhu, a senior engineer on my team, has been working in the project for a few years and created support for distributed pipelines on Apache Spark. He is a committer for the project. As a maintainer, he is also very actively engaging with the community, helping to build the roadmap and shape releases, and at the same time collaborating with the community in organizing meetups while still contributing major features and improvements. You can see footprints from his and his teammate’s contributions in the last two releases, 0.82 and 0.90.

We are making substantial improvements to the framework for Uber’s internal use cases as well. With XGBoost, Uber has pushed the envelope for large-scale distributed training to an industry-leading range of 12 billion records, unlocked distributed training with 15 terabytes of data, and enabled training of 20-layer deep-tree models. Most recently, we designed and implemented a new distributed fast histogram algorithm to significantly speed up XGBoost, added support for multiple validation datasets, and included support for the latest version of Spark.

If contributing to open source software projects and building Big Data infrastructure interests you, consider a role on our team!

Lead photo by Chad Peltola on Unsplash.