TorontoCity: Seeing the World With a Million Eyes

    Abstract

    Despite the substantial progress in recent years, the image captioning techniques are still far from being perfect. Sentences produced by existing methods, e.g. those based on RNNs, are often overly rigid and lacking in variability. This issue is related to a learning principle widely used in practice, that is, to maximize the likelihood of training samples. This principle encourages high resemblance to the “ground-truth” captions while suppressing other reasonable descriptions. Conventional evaluation metrics, e.g. BLEU and METEOR, also favor such restrictive methods. In this paper, we explore an alternative approach, with the aim to improve the naturalness and diversity — two essential properties of human expression. Specifically, we propose a new framework based on Conditional Generative Adversarial Networks (CGAN), which jointly learns a generator to produce descriptions conditioned on images and an evaluator to assess how well a description fits the visual content. It is noteworthy that training a sequence generator is nontrivial. We overcome the difficulty by Policy Gradient, a strategy stemming from Reinforcement Learning, which allows the generator to receive early feedback along the way. We tested our method on two large datasets, where it performed competitively against real people in our user study and outperformed other methods on various tasks.

    Authors

    Shenlong Wang, Min Bai, Gellert Mattyus, Hang Chu, Wenjie Luo, Bin Yang, Justin Liang, Joel Cheverie, Sanja Fidler, Raquel Urtasun

    Conference

    ICCV 2017

    Full Paper

    ‘TorontoCity: Seeing the World With a Million Eyes’ (PDF)

    Uber ATG

    Comments
    Previous articlePlug & Play Generative Networks: Conditional Iterative Generation of Images in Latent Space
    Next articleBayesian inference on random simple graphs with power law degree distributions
    Shenlong Wang
    Shenlong Wang is research scientist at Uber ATG Toronto working on the development of self-driving cars. He is also a PhD student at University of Toronto. His advisor is Prof. Raquel Urtasun. He has a broad interest in computer vision, machine learning and robotics. He is particularly interested in 3D vision and deep structured models.
    Min Bai
    Min Bai is a research scientist at Uber ATG Toronto. Before that, he was a wireless systems engineer at Apple. He has an undergraduate degree in electrical engineering from the University of Waterloo. His research interest includes various perception tasks such as segmentation, point cloud processing, online mapping.
    Gellert Mattyus
    Gellert Mattyus is a research scientist at Uber ATG Toronto working on computer vision and machine learning problems related to self-driving with an emphasis on perceiving maps. Gellert Mattyus has earned his PhD at the Remote Sensing Technology Chair of the Technical University of Munich (TUM) while working as a research scientist at the Photogrammetry and Image Analysis Department of the German Aerospace Center (DLR). After earning his PhD, Gellert Mattyus has spent nearly a year as a post-doc at the Machine Learning Group of the University of Toronto under the supervision of Professor Raquel Urtasun.
    Wenjie Luo
    Wenjie is a senior research scientist, founding member of the Uber ATG R&D team. His research interests include computer vision and machine learning, and his work spans the full autonomy stack including perception, prediction and planning. Previously, he did master in TTI-Chicago and continued to the PhD program in University of Toronto, both under Prof. Raquel Urtasun. He also spent some time at Apple SPG prior to join Uber.
    Avatar
    Bin Yang is a research scientist at Uber ATG Toronto. He's also a PhD student at University of Toronto, supervised by Prof. Raquel Urtasun. His research interest lies in computer vision and deep learning, with a focus on 3D perception in autonomous driving scenario.
    Avatar
    Justin Liang is a research scientist at Uber ATG Toronto. His research focuses on computer vision and machine learning for mapping and detection in self driving vehicles. Before joining ATG, he completed a MSc in Computer Science, supervised by Raquel Urtasun at the University of Toronto. He also has a BASc in Mechanical Engineering from the University of British Columbia.
    Raquel Urtasun
    Raquel Urtasun is the Chief Scientist for Uber ATG and the Head of Uber ATG Toronto. She is also a Professor at the University of Toronto, a Canada Research Chair in Machine Learning and Computer Vision and a co-founder of the Vector Institute for AI. She is a recipient of an NSERC EWR Steacie Award, an NVIDIA Pioneers of AI Award, a Ministry of Education and Innovation Early Researcher Award, three Google Faculty Research Awards, an Amazon Faculty Research Award, a Connaught New Researcher Award, a Fallona Family Research Award and two Best Paper Runner up Prize awarded CVPR in 2013 and 2017. She was also named Chatelaine 2018 Woman of the year, and 2018 Toronto’s top influencers by Adweek magazine