Skip to footer
Home Research Artificial Intelligence / Machine Learning Automatic Discovery of the Statistical Types of Variables in a Dataset

Automatic Discovery of the Statistical Types of Variables in a Dataset

0

Abstract

A common practice in statistics and machine learning is to assume that the statistical data types (e.g., ordinal, categorical or real-valued) of variables, and usually also the likelihood model, is known. However, as the availability of real-world data increases, this assumption becomes too restrictive. Data are often heterogeneous, complex, and improperly or incompletely documented. Surprisingly, despite their practical importance, there is still a lack of tools to automatically discover the statistical types of, as well as appropriate likelihood (noise) models for, the variables in a dataset. In this paper, we fill this gap by proposing a Bayesian method, which accurately discovers the statistical data types in both synthetic and real data.

Authors

Isabel Valera, Zoubin Ghahramani

Conference

ICML 2017

Full Paper

‘Automatic Discovery of the Statistical Types of Variables in a Dataset’ (PDF)

Uber AI

Comments
Previous article End-To-End Instance Segmentation With Recurrent Attention
Next article A birth-death process for feature allocation
Zoubin Ghahramani is Chief Scientist of Uber and a world leader in the field of machine learning, significantly advancing the state-of-the-art in algorithms that can learn from data. He is known in particular for fundamental contributions to probabilistic modeling and Bayesian approaches to machine learning systems and AI. Zoubin also maintains his roles as Professor of Information Engineering at the University of Cambridge and Deputy Director of the Leverhulme Centre for the Future of Intelligence. He was one of the founding directors of the Alan Turing Institute (the UK's national institute for Data Science and AI), and is a Fellow of St John's College Cambridge and of the Royal Society.