Automatic Discovery of the Statistical Types of Variables in a Dataset


    A common practice in statistics and machine learning is to assume that the statistical data types (e.g., ordinal, categorical or real-valued) of variables, and usually also the likelihood model, is known. However, as the availability of real-world data increases, this assumption becomes too restrictive. Data are often heterogeneous, complex, and improperly or incompletely documented. Surprisingly, despite their practical importance, there is still a lack of tools to automatically discover the statistical types of, as well as appropriate likelihood (noise) models for, the variables in a dataset. In this paper, we fill this gap by proposing a Bayesian method, which accurately discovers the statistical data types in both synthetic and real data.


    Isabel Valera, Zoubin Ghahramani


    ICML 2017

    Full Paper

    ‘Automatic Discovery of the Statistical Types of Variables in a Dataset’ (PDF)

    Uber AI

    Previous articleEnd-To-End Instance Segmentation With Recurrent Attention
    Next articleA birth-death process for feature allocation
    Zoubin Ghahramani
    Zoubin Ghahramani is Chief Scientist of Uber and a world leader in the field of machine learning, significantly advancing the state-of-the-art in algorithms that can learn from data. He is known in particular for fundamental contributions to probabilistic modeling and Bayesian approaches to machine learning systems and AI. Zoubin also maintains his roles as Professor of Information Engineering at the University of Cambridge and Deputy Director of the Leverhulme Centre for the Future of Intelligence. He was one of the founding directors of the Alan Turing Institute (the UK's national institute for Data Science and AI), and is a Fellow of St John's College Cambridge and of the Royal Society.