Skip to footer
Home Research Programming Systems Synthesizing Entity Matching Rules by Examples

Synthesizing Entity Matching Rules by Examples

0

Abstract

Entity matching (EM) is a critical part of data integration. We study how to synthesize entity matching rules from positive-negative matching examples. The core of our solution is program synthesis, a powerful tool to automatically generate rules (or programs) that satisfy a given highlevel specification, via a predefined grammar. This grammar describes a General Boolean Formula (GBF) that can include arbitrary attribute matching predicates combined by conjunctions (Ź), disjunctions (Ž) and negations ( ), and is expressive enough to model EM problems, from capturing arbitrary attribute combinations to handling missing attribute values. The rules in the form of GBF are more concise than traditional EM rules represented in Disjunctive Normal Form (DNF). Consequently, they are more interpretable than decision trees and other machine learning algorithms that output deep trees with many branches. We present a new synthesis algorithm that, given only positivenegative examples as input, synthesizes EM rules that are effective over the entire dataset. Extensive experiments show that we outperform other interpretable rules (e.g., decision trees with low depth) in effectiveness, and are comparable with non-interpretable tools (e.g., decision trees with high depth, gradient-boosting trees, random forests and SVM).

Authors

Rohit Singh, Venkata Vamsikrishna Meduri, Ahmed K. Elmagarmid, Samuel Madden, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz, Armando Solar-Lezama, Nan Tang

Publication

PVLDB

Full Paper

‘Synthesizing Entity Matching Rules by Examples’ (PDF)

Uber AI

Comments
Previous article Uber vs Taxi: A Driver’s Eye View
Next article SGN: Sequential Grouping Networks for Instance Segmentation
Rohit Singh
Rohit is an AI researcher with a PhD from MIT Computer Science and Artificial Intelligence Lab (CSAIL). He is currently working on applications of various AI techniques with the Pyro programming language across product teams at Uber. His previous work has involved applications of Machine Learning, Quantitative Game Theory and Program Synthesis in multiple domains from the fields of Compilers and Databases. Rohit has worked as an intern at Google where he used the Google Brain deep-learning framework for an application with the YouTube team and as a PM intern at Yelp where he worked on a Machine Learning application on Ad CTR prediction.