Tag: Big Data
Building a Large-scale Transactional Data Lake at Uber Using Apache Hudi
The Apache Hudi team at Uber reflects on the open source project's history as it graduates to a Top Level Project under the Apache Software Foundation.
Monitoring Data Quality at Scale with Statistical Modeling
Uber employs statistical modeling to find anomalies in data and continually monitor data quality.
Designing a Production-Ready Kappa Architecture for Timely Data Stream Processing
We implemented a Kappa architecture at Uber to effectively backfill streaming data at scale, ensuring accurate data in our platform.
Engineering SQL Support on Apache Pinot at Uber
We engineered full SQL support on Apache Pinot to enable quick analysis and reporting on aggregated data, leading to improved experiences on our platform.
Uber’s Data Platform in 2019: Transforming Information to Intelligence
In 2019, Uber's Data Platform team leveraged data science to improve the efficiency of our infrastructure, enabling us to compute optimum datastore and hardware usage.
Productionizing Distributed XGBoost to Train Deep Tree Models with Large Data Sets at Uber
We share technical challenges and lessons learned while productionizing and scaling XGBoost to train distributed gradient boosted algorithms at Uber.
Building a Better Big Data Architecture: Meet Uber’s Presto Team
Uber has embraced Presto, a high performance, distributed SQL query engine, and joined the Presto Foundation. Meet the Uber engineers who contribute to and use Presto on a daily basis.
Searchable Ground Truth: Querying Uncommon Scenarios in Self-Driving Car Development
When developing Uber's self driving car systems, engineers found a way to identify edge case scenarios amongst terabytes of sensor data representing real-world situations.
Uber Joins LF Presto Foundation to Advance Open Source Analytics
Uber is honored to join the Presto Foundation, a new initiative hosted by the Linux Foundation, to advance the open source data processing community.
Less is More: Engineering Data Warehouse Efficiency with Minimalist Design
Data science helps Uber determine which tables in a database should be off-boarded to another source to maximize the efficiency of our data warehouse.
Making Apache Spark Effortless for All of Uber
Uber engineers created uSCS, a Spark-as-a-Service solution that helps manage Apache Spark jobs throughout large organizations.
Consistent Data Partitioning through Global Indexing for Large Apache Hadoop Tables at Uber
Performing updates of individual records in Uber's over 100 petabyte Apache Hadoop data lake required building Global Index, a component that manages data bookkeeping and lookups at scale.
Uber Submits Hudi, an Open Source Big Data Library, to The Apache Software Foundation
We submitted Hudi to the Apache Incubator to ensure the long-term growth and sustainability of the project under The Apache Software Foundation.
Uber Case Study: Choosing the Right HDFS File Format for Your Apache Spark...
Uber's Maps Collection and Reporting (MapCARs) team shares best practices when choosing which HDFS file formats are optimal for use with Apache Spark.
Solving Big Data Challenges with Data Science at Uber
How engineers and data scientists at Uber came together to come up with a means of partially replicating Vertica clusters to better scale our data volume.
Accessible Machine Learning through Data Workflow Management
Uber engineers offer two common use cases showing how we orchestrate machine learning model training in our data workflow engine.
DBEvents: A Standardized Framework for Efficiently Ingesting Data into Uber’s Apache Hadoop Data Lake
Uber engineers discuss the development of DBEvents, a change data capture system designed for high data quality and freshness that is capable of operating on a global scale.
Year in Review: 2018 Highlights from the Uber Engineering Blog
Our editors spotlight some of the year's most popular articles, from an overview of our Big Data platform to a first-person account of an engineer's immigrant journey.
Sessionizing Uber Trips in Real Time
Uber's many data flows required modeling the data associated with a specific task, such as a rider trip, into a state machine. The state machine lets engineers focus on just the events needed to successfully accomplish a trip.
Peloton: Uber’s Unified Resource Scheduler for Diverse Cluster Workloads
Uber developed Peloton to help us balance resource use, elastically share resources, and plan for future capacity needs.