By Chunyuan Li, Rosanne Liu, Jason Yosinski
Neural networks have revolutionized machine learning over the last decade, rising from a relatively obscure academic research area to a mainstay of industry powering a myriad of applications wherever large volumes of data are available. Uber uses neural networks for a diverse set of applications, from building maps of the world based on computer vision models, to enabling faster customer response using natural language understanding, to minimizing wait times by modeling spatiotemporal rider demand patterns across cities.
In many cases, the most successful neural networks employ large numbers of parameters—from a few million to hundreds of millions or more—to achieve optimal performance. The exciting news is that these huge networks often work very well, achieving impressive performance at whatever their task. But such models are fundamentally complex systems, belying easy understanding because of the large number of parameters that are learned without human intervention. Nonetheless, efforts at understanding network behavior persist, because as networks increasingly impact society it becomes increasingly important to understand their operation, and because better understanding network mechanisms and properties will hasten the construction of the next generation of models.
In our paper, Measuring the Intrinsic Dimension of Objective Landscapes, to be presented at ICLR 2018, we contribute to this ongoing effort by developing a simple way of measuring a fundamental network property known as intrinsic dimension. In the paper, we develop intrinsic dimension as a quantification of the complexity of a model in a manner decoupled from its raw parameter count, and we provide a simple way of measuring this dimension using random projections. We find that many problems have smaller intrinsic dimension than one might suspect. By using intrinsic dimension to compare across problem domains, we measure, for example, that solving the inverted pendulum problem is about 100 times easier than classifying digits from MNIST, and playing Atari Pong from pixels is about as hard as classifying CIFAR-10.
Our approach and a few interesting findings are summarized in the video, below:
In typical neural network training, we randomly instantiate a network’s parameters and then optimize them to minimize some loss. We can think of this strategy as choosing an initial point in parameter space and then slowly following gradients from that point to one with a lower loss. In this paper, we modify this typical training procedure slightly: we choose the initial point in the same way, but then instead of optimizing in the full parameter space, we generate a random subspace around the initial point and allow the optimizer to move only in that subspace. We construct the random subspace by sampling a set of random directions from the initial point; these random directions are then frozen for the duration of training. Optimization proceeds directly in the coordinate system of the subspace. Computationally, this requires projection from the subspace to the native space, and this projection can be achieved in a few different ways, with some simpler (projection through a dense random matrix), and others more complex but more computationally efficient (projection through carefully constructed random sparse matrices or via the Fastfood transform).
How well will this random projection training work? Well, it depends entirely on the size of the random subspace. On one extreme, using a one-dimensional random subspace corresponds to line search in a random direction, and using such a search will likely never lead to a working solution. On the other extreme, if we use enough random vectors to span the entire space, it is guaranteed that any available solution from the original space will be accessible. The key idea of this paper is to carefully increase the dimension of the subspace until the first solutions are found. As we grow the space, we generally observe a transition from networks that do not work to ones that do, and we call the dimension at which this transition occurs the intrinsic dimension.
In the following sections, we use this approach to measuring the intrinsic dimension of networks used to solve MNIST, random MNIST variants, CIFAR-10, and three reinforcement learning (RL) tasks, gleaning some interesting conclusions along the way.
Measuring intrinsic dimension using MNIST
First, we train a three-layer, fully connected (FC) network with layer sizes 784-200-200-10 to classify MNIST. This network will have a full parameter space of about 200,000 dimensions. If we measure its intrinsic dimension (see paper for more details), we get only about 750.
One salient conclusion is that 750 is rather small. In other words, using only 750 degrees of freedom, or 0.4 percent of the full parameter space, we can reach a fairly good solution. The number of degrees of freedom required (750) is much smaller than the number of parameters that even a ten-class linear classifier would entail (7840). In fact, 750 is even smaller than the number of input pixels (784), which for MNIST makes at least some sense, as many of the edge pixels are black for every example in the dataset.
But let us think carefully about what this number 750 means. Imagine that in three dimensions, we choose a random one-dimensional line and manage to run into a solution. It turns out that in this example, the solution we run into must have at least two dimensions (more precisely: the solution manifold almost surely has at least two dimensions).
So, in the case of our MNIST network, if in 200,000 dimensions we run into a solution by using a 750 dimensional subspace, it tells us the dimension of this “blob” of solutions we have found must be at least 199,250. In other words, this solution we have found is big—it has a lot of redundancy, or it is a manifold with many directions in which one can travel and still remain on the solution set. Cool, right?
MNIST, wider or taller
Next, we try making our network a little wider, say using hidden layer sizes with width 225 instead of 200. Measuring this network reveals something surprising: the intrinsic dimension is more or less still the same: 750 (see paper for precise measurements). In fact, we still find about 750 if we make the layers much wider, or if we add more layers. Thus, it would seem this quantity we are measuring is a fairly stable metric across a family of models (or at least across this family for this dataset).
But what does the stability of this metric tell us?
This is one of the coolest parts: adding more parameters to the network reveals additional characteristics of the objective landscape. Since the intrinsic dimension is not changing, it means every new parameter we add—every extra dimension we add to the native parameter space—just goes directly into the solution set. The extra parameter merely serves to increase the redundancy of the solution set by one dimension.
Of course, this phenomenon may not always hold for every family of networks and for every dataset. But for this dataset and network family, we discovered that we are on a plateau in hyperparameter space with constant intrinsic dimension. In the paper, we show that this plateau is fairly wide, persisting when layers are widened or new layers added.
Many researchers have noticed that larger networks can be easier to train, and this result suggests a possible reason why: as we add parameters, we increase how much the solution set covers the space. There are definitely some subtleties here; so do check out the paper for caveats.
We can also consider these results from the perspective of compressibility. To checkpoint and restore one of these models, we only need to store a single seed for the random subspace creation and then the 750 learned parameters. This means that the model can be compressed, in this case from its native parameter space of 200,000 by 250 times. This compression method is not necessarily the most efficient out there, but it is quite simple: any end-to-end parameterized model can just be trained end-to-end in a random subspace instead.
We can also think of this approach as a way of upper-bounding the minimum description length (MDL), of the model. If you are familiar with Occam’s razor, which states that the simplest explanation is usually the best one, MDL is similar: all else being equal, models with smaller MDL are preferred.
Fully connected (FC) vs. convolutional neural networks (CNNs)
If we use the same dataset—MNIST—but model it with a convolutional network, we measure a much smaller intrinsic dimension of 290 (see Figure 1). Using our MDL argument, this smaller number suggests that CNNs are better for classifying MNIST than FC networks. If you have worked with neural networks for a while, this observation will probably not come as a big enough surprise to drop everything and immediately tweet this article. (But who are we to stop you?)
Indeed, everyone already appreciates that CNNs are great models for image classification. But are they always more efficient than FC networks? What if we trained on a version of MNIST with shuffled input pixels, as done in Zhang et al. (ICLR 2017)? In this case, the permutation-invariant FC network has the same Intrinsic Dimension of 750 as always, but the CNN now measures in at 1,400!
So, CNNs are better until the assumption of local structure is broken, after which they are measurably worse on the task. These few conclusions might be intuitive to seasoned practitioners, but the point is that we have been able to obtain them by measurement rather than by intuition alone. In a world where we train larger and larger models, consisting of many sub-models and trained using many losses, designing architectures for sub-models may be far less obvious. And in these cases, the ability to rely on careful quantification rather than overextended intuition may prove very useful.
Random label memorization
What if we measure networks trained to memorize randomly-shuffled labels, as done in Zhang et al. (ICLR 2017)?
It turns out that a network trained to memorize 5,000 random labels on MNIST inputs requires an intrinsic dimension of 90,000, which is much larger than we have observed above. This much higher value reflects the larger capacity the network needs to memorize each random label, in this case, working out to 18 degrees of freedom for each memorized label.
If we train our model to memorize 50,000 random labels (10 times more), the intrinsic dimension grows, but actually not by that much: now it measures at 190,000. This is larger, but not that much larger, requiring only 3.8 degrees of freedom for each memorized label.
This is an interesting result through which we observe an effect similar to generalization, but here it is generalization from one part of a random training set to another part (also random). Because labels for all subsets of data are random, this result was not immediately expected. It would seem that training on some random labels forces the network to set up a base infrastructure for memorization, and that this infrastructure is then used to make further memorization more efficient. Future study of this phenomenon in greater depth could produce interesting work.
CIFAR-10 and ImageNet
In our paper, we spent a bit more time on CIFAR-10 and ImageNet, but here we will just briefly summarize the results.
CIFAR-10 is about 10 times harder than MNIST, with both FC and CNN models showing intrinsic dimensions about 10 times as large. Because the CIFAR-10 input dimension (3072) is only about four times larger than that for MNIST (784), we can attribute this additional complexity of classes in the data set itself, e.g., images of airplanes showing greater variability than those of the digit 4.
The subspace training approach can be applied just as easily in reinforcement learning scenarios. In fact, the random subspace approach herein was motivated by the results presented in the Evolutionary Strategies (ES) paper from OpenAI: that approximating gradients along a small number of dimensions works at all hinted that solution sets are more redundant than we might have imagined.
As we report in our paper, we train networks using both ES and Deep Q-Networks (DQN) and found that humanoids can learn to walk in about 700 dimensions and that agents can play Atari Pong from pixels in 6,000 dimensions. So we might say these tasks require approximately the complexity of classifying MNIST and CIFAR-10, respectively. These relatively low dimensions may be why random search and gradient-free methods work as well as they do.
Finally, one last result we report in our paper is the pole balancing problem. This problem requires that an agent balance a pole by moving the bottom left and right. While this problem may be challenging for humans, since they have to react quickly enough to prevent the pole from falling over, for simulated agents not encumbered by slow reflexes, this problem is incredibly simple. In fact, it is solvable with a whopping intrinsic dimension of… four. This seems almost comically easy (~100 times easier than MNIST) and this simplicity is something we researchers should keep in mind when spending time on toy problems.
In this article, we have demonstrated a simple way of measuring intrinsic dimension, a property that provides much needed insight into the structure of the loss landscapes of neural networks and allows for more careful, quantitative comparison between models. We expect this measurement will be useful in a few directions. First, it provides researchers with a basic, order-of-magnitude understanding of the complexity of problems they work with. Second, researchers may build on this work, extending to the case of non-linear and non-uniform-random projections, which may provide tighter bounds on model MDL. Finally, we hope our learnings will inspire others to find approaches useful for choosing the best model or sub-model for their research. If this work interests you and you would like to learn more, check out our paper, use our code to measure your own networks, and if you would like to work on these sorts of research challenges with Uber AI Labs, apply for a role on our team.