By Mrina Natarajan & Hohyun Shim
Fast, granular, reliable ROI on ad performance was our bugle call to build Euclid, Uber’s in-house marketing platform. Early this year, Euclid replaced a legacy system, which processed ROI data somewhat manually as it struggled to keep up with Uber’s scale and data complexity.
Unlike any solution out of the box, the Hadoop, Spark-based Euclid ecosystem lets us scale for Uber’s growth with a channel agnostic API plugin architecture called MaRS as well as a custom ETL pipeline that streams heterogenous data into a single schema for easy querying. A visual layer on top of Euclid lets marketers pull ROI metrics to optimize ad spend. Euclid’s pattern recognition capabilities powers marketing intelligence further. With these capabilities, marketers can detect fraud, automatically allocate ad budget (by channel/country/city/product), automate ad spend, manage campaigns, target audience profiles at a granular level, and bid on ads real-time.
Problem Solving with Euclid
The reason Euclid is what it is today is because of long-standing issues we addressed in the marketing data world:
Automating ad spend.
To do away with collecting and processing data manually, we needed a system to not only analyze data reliably at high speeds, but show spending at event-specific granular levels such as impressions, clicks, installs from hundreds of external ad networks all over the world.
Dealing with data complexity.
Data at Uber comes from a hundred plus channels per ad network each with ten thousand plus campaigns, millions of keywords, and many creatives. Throw in multiple currencies and various Uber product types on top of that. Besides scale, data variety too plays a part—different kinds of ads (social, display, search, job boards, and so on) result in complex schema and data mapping. When delivering complex data at such scale and volume, SLA matters. We have to ensure we provide data at promised intervals on an hourly, daily basis.
Reporting accurate ROI.
In the past, teams measured channel-level ROI with time lags. Euclid’s challenge lay in not just closing the lag but also measuring ROI at a granular level—at the ad, creative and keyword level for example. To get there, we needed to join internal with external data sources and apply predictive analytics to report future trends and multi-touch attribution.
The Euclid system has three main parts: MaRS (Marketing Report Service), an email service, and an ETL data pipeline.
MaRS (Marketing Report Service)
A plugin-based report ingestion microservice, MaRS is the reason we can scale fast, up to 80 channels in just a couple of months. Before Euclid, marketing managers manually generated high-level weekly spend data for hundreds of channels around the globe. Data was nongranular, it had the potential for human error, and marketing teams could not act fast on spending goals nor optimize ad spend accurately.
MaRS solved this problem with a standard API interface to import ad campaign data via plugins. The plugins isolate ad network logic from the rest of the ETL data pipeline. This isolation lets us develop and test ad network dependent logic independently from the Hadoop ETL pipeline. The ETL data pipeline calls the MaRS API, it passes a network ID as a parameter to get various ad spend data without knowing any ad network specific logic.
Such a design gives us the advantage of outsourcing plugin development to external vendors who can add plugins for many more ad networks. It’s one of the ways of how we scaled the number of API channel imports for Euclid. Currently, Euclid supports 30+ plugins including Facebook, Twitter, and AdWords.
It’s straightforward to implement a plugin. All an external plugin developer needs to do is inherit three base classes, define the Avro schema, and apply a normalization rule:
- Auth class: Handles network API authentication.
- Extractor class: Extracts data through an ad network API.
- Transform class: Converts the API data format into Avro.
- Avro schema: Stores the raw ad network data schema.
- Normalization rule: Applies a mapping rule to convert fields from the raw Avro schema to a uniform model table.
Engineers implement plugins independently without worrying about underlying changes to the service environment or data infrastructure. The config-based design of the Hadoop Hive Pipeline combined with the API plugin based MaRS architecture, allows us to add new channel plugins rapidly without making code changes.
Euclid Email Service
For channels that lack APIs, Euclid provides an email attachment-based ingestion framework. Using this system, ad channels can directly send CSV attachments in an expected format. Included in the MaRS architecture, the service allows ad networks to push their daily spend as email attachments to a predefined collection point. Euclid then automatically validates and lands the data into the ETL pipeline. This is how Euclid imports data from dozens of small channels every day.
ETL Data Pipeline Workflow
- Extractor: Requests MaRS to ingest data from the ad network APIs.
- Normalizer: Normalizes ad network specific raw spend data into a unified Hive table called fact_marketing_spend_ad.
- ROI joiner: Joins spend with conversion data such as user level signups and first trips to get granular ROI metrics.
With MaRS handling the ad network logic, the channel agnostic ETL pipeline lets us engineer Uber marketing data at scale while minimizing costs to operate and maintain the pipeline. Since each channel differs in its campaign and report schema, it’s hard to query and analyze data directly. But the config-based normalization step in Euclid aligns heterogeneous data in various Avro schema into a single model table in Parquet, which makes the data easy to aggregate, compare, and analyze across dimensions and slices.
That said, daily ingestion is still a tricky process given varying data schema and readiness SLAs. For example, APIs can break when a version deprecates, when credentials or the upstream channel data format changes, or if upstream data gets corrupt. To get around this problem, Euclid’s custom MySQL monitoring and alerting does all kinds of data health checks. Its automated backfill policy, failure logging, and on-call alerting pinpoint the cause of data integrity issues. In this way, the Euclid pipeline ingests hundreds of millions of records daily into Hadoop.
Before and After Euclid Optimized Uber Marketing ROI
Granular ROI Ad Metrics
Besides standard acquisition metrics, the cost of a user making their first Uber trip is an important ad performance ROI metric. Previously, when teams measured ad ROI there were time gaps. Now with Euclid, marketers access granular ROI ad data quickly and reliably. Powered by Euclid’s hierarchical campaign spend dataset, marketers drill down deeper — to the creative level for display ads on social channels, to the keyword level for search channels, to the job board level for job board channels, and so forth.
Here are three display ad creatives marketers tested for the same campaign. Which ad do you think performed better?
To see which ad results in a better click-through rate (CTR), marketers drill down to the creative level performance data via Euclid’s visualization layer. They see ROI granular spending data Euclid ingested and joined with user conversion metrics. In this example, the zoomed-in version, Variant B, with its significantly higher CTR works better than either the Control version or the zoomed-out creative (Variant A).
Given thousands of ads running across hundreds of channels, where external agencies maintain many of them, it’s humanly impossible to check the status of every single ad. For such cases, Euclid lets Uber channel managers easily surface underperforming ads.
Predicting First Trips
The conversion journey from Uber app install to first trip usually involves time gaps, and if we don’t know when a first trip occurs we can’t attribute a particular marketing activity as the motivating trigger. Yesterday’s ad spend could result in clicks or installs, but not drive signups or first trips yet. How then can we measure marketing ROI? After Euclid ingests data from impressions, clicks, and installs, it applies patterns of where and when partners and riders sign up, types of device they use, ad channels they engage in, and so on. Mapping this type of information to historical conversion data, Euclid statistically predicts when and whether a user is likely to make their first trip. As Euclid continues to accumulate more data into its training models in this fashion, prediction accuracy increases.
Solving Multi-touch Attribution
When a user sees a social network ad, then searches “Uber” on the web, clicks a search ad, and later signs up with Uber, it’s not fair to give all the credit to the search ad. Multi-touch attribution, another challenge Euclid tackles, involves looking at the impression level data, analyzing the complete user conversion journey, and then attributing the right weight of credit for each conversion to multiple channels. Clear multi-touch attribution allows marketers to allocate the right budget to the right channels. Euclid ingests impression level data every day, then analyzes and trains multi-touch attribution models using its predictive engine. Powered by Hadoop and Spark, Euclid soon plans to deploy marketing multi-touch attribution models in the production data pipeline.
What we’ve developed so far in the Euclid tech stack has helped marketers understand which creatives, search keywords, and ad campaigns work best for a particular channel. We’re looking to make the next incarnation of Euclid an even more advanced marketing engine involving data management platforms (DMP) and demand-side platforms (DSP). So if you have experience with Hadoop, Hive, Spark, Kafka, Vertica, or an understanding of SQL databases to implement ETL pipelines, check out the engineering talent we’re looking to hire to further Euclid’s development. Come be part of our story!
Hohyun Shim, software engineer in Uber Engineering’s business platform group who leads Euclid’s architecture, wrote this article in conjunction with Mrina Natarajan.