Skip to footer

Tag: Big Data

Building a Large-scale Transactional Data Lake at Uber Using Apache Hudi

The Apache Hudi team at Uber reflects on the open source project's history as it graduates to a Top Level Project under the Apache Software Foundation.
an image with 24 cats all purple except for one red

Monitoring Data Quality at Scale with Statistical Modeling

Uber employs statistical modeling to find anomalies in data and continually monitor data quality.
elevated freeways

Designing a Production-Ready Kappa Architecture for Timely Data Stream Processing

We implemented a Kappa architecture at Uber to effectively backfill streaming data at scale, ensuring accurate data in our platform.

Engineering SQL Support on Apache Pinot at Uber

We engineered full SQL support on Apache Pinot to enable quick analysis and reporting on aggregated data, leading to improved experiences on our platform.

Uber’s Data Platform in 2019: Transforming Information to Intelligence

In 2019, Uber's Data Platform team leveraged data science to improve the efficiency of our infrastructure, enabling us to compute optimum datastore and hardware usage.

Productionizing Distributed XGBoost to Train Deep Tree Models with Large Data Sets at Uber

We share technical challenges and lessons learned while productionizing and scaling XGBoost to train distributed gradient boosted algorithms at Uber.
Presto logo

Building a Better Big Data Architecture: Meet Uber’s Presto Team

Uber has embraced Presto, a high performance, distributed SQL query engine, and joined the Presto Foundation. Meet the Uber engineers who contribute to and use Presto on a daily basis.
Pedestrian density map

Searchable Ground Truth: Querying Uncommon Scenarios in Self-Driving Car Development

When developing Uber's self driving car systems, engineers found a way to identify edge case scenarios amongst terabytes of sensor data representing real-world situations.

Uber Joins LF Presto Foundation to Advance Open Source Analytics

Uber is honored to join the Presto Foundation, a new initiative hosted by the Linux Foundation, to advance the open source data processing community.
word cloud

Less is More: Engineering Data Warehouse Efficiency with Minimalist Design

Data science helps Uber determine which tables in a database should be off-boarded to another source to maximize the efficiency of our data warehouse.
Chinese Water Dragon photo by InspiredImages/Pixabay

Making Apache Spark Effortless for All of Uber

Uber engineers created uSCS, a Spark-as-a-Service solution that helps manage Apache Spark jobs throughout large organizations.
elephant

Consistent Data Partitioning through Global Indexing for Large Apache Hadoop Tables at Uber

Performing updates of individual records in Uber's over 100 petabyte Apache Hadoop data lake required building Global Index, a component that manages data bookkeeping and lookups at scale.

Uber Submits Hudi, an Open Source Big Data Library, to The Apache Software Foundation

We submitted Hudi to the Apache Incubator to ensure the long-term growth and sustainability of the project under The Apache Software Foundation.

Uber Case Study: Choosing the Right HDFS File Format for Your Apache Spark...

Uber's Maps Collection and Reporting (MapCARs) team shares best practices when choosing which HDFS file formats are optimal for use with Apache Spark.
server racks

Solving Big Data Challenges with Data Science at Uber

How engineers and data scientists at Uber came together to come up with a means of partially replicating Vertica clusters to better scale our data volume.
Complex freeway interchange

Accessible Machine Learning through Data Workflow Management

Uber engineers offer two common use cases showing how we orchestrate machine learning model training in our data workflow engine.
Elephant silhouette

DBEvents: A Standardized Framework for Efficiently Ingesting Data into Uber’s Apache Hadoop Data Lake

Uber engineers discuss the development of DBEvents, a change data capture system designed for high data quality and freshness that is capable of operating on a global scale.

Year in Review: 2018 Highlights from the Uber Engineering Blog

Our editors spotlight some of the year's most popular articles, from an overview of our Big Data platform to a first-person account of an engineer's immigrant journey.
Image of birds flying

Sessionizing Uber Trips in Real Time

Uber's many data flows required modeling the data associated with a specific task, such as a rider trip, into a state machine. The state machine lets engineers focus on just the events needed to successfully accomplish a trip.

Peloton: Uber’s Unified Resource Scheduler for Diverse Cluster Workloads

Uber developed Peloton to help us balance resource use, elastically share resources, and plan for future capacity needs.

Popular Articles