# Mengye Ren

## Engineering Blog Articles

### SBNet: Leveraging Activation Block Sparsity for Speeding up Convolutional Neural Networks

Uber ATG Toronto developed Sparse Blocks Network (SBNet), an open source algorithm for TensorFlow, to speed up inference of our 3D vehicle detection systems while lowering computational costs.

## Research Papers

### Graph HyperNetworks for Neural Architecture Search

**C. Zhang**,

**M. Ren**,

**R. Urtasun**

Neural architecture search (NAS) automatically finds the best task-specific neural network topology, outperforming many manual architecture designs. However, it can be prohibitively expensive as the search requires training thousands of different networks, while each can last for hours. In this work, we propose the Graph HyperNetwork (GHN) to amortize the search cost: given an architecture, it directly generates the weights by running inference on a graph neural network. [...]

**[PDF]**

*Meta Learning workshop @ Neural Information Processing Systems (*

**NeurIPS**), 2018### Incremental Few-Shot Learning with Attention Attractor Networks

**M. Ren**,

**R. Liao**, E. Fetaya, R. Zemel

This paper addresses the problem, incremental few-shot learning, where a regular classification network has already been trained to recognize a set of base classes; and several extra novel classes are being considered, each with only a few labeled examples. [...]

**[PDF]**

*Meta Learning workshop @ NeurIPS*

**(NeurIPS)**, 2018### Learning to Reweight Examples for Robust Deep Learning

**M. Ren**,

**W. Zeng**,

**B. Yang**,

**R. Urtasun**

Deep neural networks have been shown to be very powerful modeling tools for many supervised learning tasks involving complex input patterns. However, they can also easily overfit to training set biases and label noises. [...]

**[PDF]**

*Conference on Computer Vision and Pattern (*

**ICML**), 2018### SBNet: Sparse Block’s Network for Fast Inference

**M. Ren**,

**A. Pokrovsky**,

**B. Yang**,

**R. Urtasun**

Conventional deep convolutional neural networks (CNNs) apply convolution operators uniformly in space across all feature maps for hundreds of layers - this incurs a high computational cost for real-time applications. For many problems such as object detection and semantic segmentation, we are able to obtain a low-cost computation mask, either from a priori problem knowledge, or from a low-resolution segmentation network. [...]

**[PDF]**

*Conference on Computer Vision and Pattern Recognition (*

**CVPR**), 2018### Understanding Short-Horizon Bias in Stochastic Meta-Optimization

Y. Wu,

Careful tuning of the learning rate, or even schedules thereof, can be crucial to effective neural net training. There has been much recent interest in gradient-based meta-optimization, where one tunes hyperparameters, or even learns an optimizer, in order to minimize the expected loss when the training procedure is unrolled. [...]

**M. Ren**,**R. Liao**, R. GrosseCareful tuning of the learning rate, or even schedules thereof, can be crucial to effective neural net training. There has been much recent interest in gradient-based meta-optimization, where one tunes hyperparameters, or even learns an optimizer, in order to minimize the expected loss when the training procedure is unrolled. [...]

**[PDF]***International Conference on Learning Representations (***ICLR**), 2018### Meta-Learning for Semi-Supervised Few-Shot Classification

**M. Ren**, E. Triantafilou, S. Ravi, J. Snell, K. Swersky, J. Tenenbaum, H. Larochelle, R. Zemel

In few-shot classification, we are interested in learning algorithms that train a classifier from only a handful of labeled examples. Recent progress in few-shot classification has featured meta-learning, in which a parameterized model for a learning algorithm is defined and trained on episodes representing different classification problems, each with a small labeled training set and its corresponding test set. [...]

**[PDF]**

Code & Datasets:

**[LINK]**

*International Conference on Learning Representations (*

**ICLR**), 2018### The Reversible Residual Network: Backpropagation Without Storing Activations

A. Gomez,

Residual Networks (ResNets) have demonstrated significant improvement over traditional Convolutional Neural Networks (CNNs) on image classification, increasing in performance as networks grow both deeper and wider. However, memory consumption becomes a bottleneck as one needs to store all the intermediate activations for calculating gradients using backpropagation. [...]

**M. Ren**,**Raquel Urtasun**, R. GrosseResidual Networks (ResNets) have demonstrated significant improvement over traditional Convolutional Neural Networks (CNNs) on image classification, increasing in performance as networks grow both deeper and wider. However, memory consumption becomes a bottleneck as one needs to store all the intermediate activations for calculating gradients using backpropagation. [...]

**[PDF]***Advances in Neural Information Processing Systems (***NeurIPS**), 2017### End-To-End Instance Segmentation With Recurrent Attention

**M. Ren**, R. Zemel

While convolutional neural networks have gained impressive success recently in solving structured prediction problems such as semantic segmentation, it remains a challenge to differentiate individual object instances in the scene. Instance segmentation is very important in a variety of applications, such as autonomous driving, image captioning, and visual question answering. [...]

**[PDF]**

Supplementary Materials:

**[LINK]**

Code:

**[LINK]**

*Conference on Computer Vision and Pattern Recognition (*

**CVPR**), 2017### Normalizing the Normalizers: Comparing and Extending Network Normalization Scheme

**M. Ren**,

**R. Liao**,

**R. Urtasun**, F. H. Sinz, R. Zemel

Normalization techniques have only recently begun to be exploited in supervised learning tasks. Batch normalization exploits mini-batch statistics to normalize the activations. This was shown to speed up training and result in better models. However its success has been very limited when dealing with recurrent neural networks. On the other hand, layer normalization normalizes the activations across all activities within a layer. This was shown to work well in the recurrent setting. In this paper we propose a unified view of normalization techniques, as forms of divisive normalization, which includes layer and batch normalization as special cases. [...]

**[PDF]**

*International Conference on Learning Representations (*

**ICLR**), 2017