We recently published an article announcing five papers on deep neuroevolution, including the discovery that genetic algorithms can solve deep reinforcement learning problems as well as popular alternatives, such as deep Q-learning and policy gradients. That work follows on Salimans et al. 2017, which showed the same for evolution strategies (ES), another neuroevolution algorithm. We further described how ES can be improved by adding exploration in the form of a pressure for agents to be novel, and how ES relates to gradient descent. All of that research was computationally expensive: It was conducted on between 720 and 3000 CPUs distributed across a large, high-performance computing cluster, seemingly putting deep neuroevolution out of reach for most researchers, students, companies, and hobbyists.

Today, we are releasing open source code that makes it possible to conduct such research much faster and cheaper. With this code, the time it takes to train deep neural networks to play Atari, which takes ~1 hour on 720 CPUs, now takes ~4 hours on a single modern desktop. This point is important because it dramatically affects our perceptions of the range of resources required to conduct this kind of research, making it accessible to a much larger group of researchers.


Neuroevolution techniques are a competitive alternative for solving challenging deep reinforcement learning problems, such as Atari and humanoid locomotion. Shown are behaviors of deep neural networks trained with a simple genetic algorithm.


What changed to make it faster and work on only one computer?

It turns out that modern, high-end desktops, which have dozens of virtual cores, themselves act like a modest computing cluster. If evaluations are properly executed in parallel, a run that takes 1 hour on 720 cores can be run on the CPUs of a 48-core personal computer in 16 hours, which is slower, but not prohibitively so. Modern desktops also have GPUs, however, which are fast at running deep neural networks (DNNs). Our code maximizes the use of CPUs and GPUs in parallel. It runs deep neural networks on the GPU, the domains (e.g. video games or physics simulators) on the CPU, and executes multiple evaluations in parallel in a batch, allowing all available hardware to be utilized efficiently. As described below, it also contains custom TensorFlow operations, which significantly improve training speed.

Enabling training on GPUs required a few modifications to how the neural network operations are computed. In our setup, running a single neural network was faster using a single CPU than a GPU, but GPUs benefit greatly when similar computations (e.g. the forward passes of neural networks) are performed in parallel. To harness the GPU we thus aggregate multiple neural network forward passes into batches. Doing so is common in neural network research, but usually involves the same neural network processing a batch of different inputs. Evolution, however, operates on populations of different neural networks, but the speedups occur even if the networks are different (though memory requirements are increased). We implemented this population batching with basic TensorFlow operations and it yielded roughly a 2x speedup, reducing training time to around 8 hours. However, we realized we could do better. While TensorFlow provides all the operations needed, these operations are not tailored for this type of computation. We thus added two types of custom TensorFlow operations, which combined to yield another 2x speedup, reducing training to roughly 4 hours on a single machine, the number mentioned originally.

The first customized TensorFlow operation sped up the GPUs significantly. It is built specifically for heterogeneous neural network computation in RL domains where episodes are of different length, as is true in Atari and many simulated robot learning tasks. It allows the GPU to only run as many networks as need to be run, instead of requiring a fixed (large) set of networks to be run each iteration.

The improvements described so far made the GPUs more cost effective than CPUs. In fact, the GPUs were so fast that the Atari simulations (CPU) could not keep up, even when multiprocessing libraries were used to parallelize the computation. To improve simulation performance, we added a second set of custom TensorFlow operations. These changed the wrapper of the Atari simulation from Python to customized TensorFlow commands (reset, step, observation) that take advantage of the fast multithreading capabilities provided by TensorFlow without the typical slowdowns associated with Python and TensorFlow interacting. Overall these changes led to roughly a 3x speedup in the Atari simulator. These innovations should speed up any reinforcement learning research that has multiple instances of a domain (e.g. Atari or the MuJoCo physics simulator) running in parallel, which is an increasingly common technique in reinforcement learning, such as distributed deep Q-learning (DQN) and distributed policy gradients (e.g. A3C).

Once we had the ability to run a population of networks quickly on GPUs and faster domain simulators on the CPUs, the challenge became keeping all of the resources on the computer running as much as possible. If we did a forward pass on each neural network, asking it what action should be taken in the current state, then while each is computing its answer the CPUs running the game simulators would be doing nothing. Similarly, if we then took the actions and asked the domain simulators “what states result from these actions?”, then the GPUs running the neural networks would be idle during this simulation step. This is the Multithreaded CPU+GPU option shown below. While an improvement over single threaded computation, it is still inefficient.

A better solution is to have two or more subsets of neural networks paired with simulators, and keep the GPUs and CPUs running at the same time updating networks or simulations from different sets depending on which step (neural network or simulation) is ready to be taken. This approach is the rightmost, “pipelined CPU+GPU” option shown in the following figure. With it, and the other improvements mentioned above, we were able to get the training time for ~4M parameter neural networks down to the number mentioned above (~4 hours on a single computer).

CPU and GPU acceleration for AI
Optimizing the scheduling of populations of heterogeneous networks in RL. The blue boxes are domain simulators, such as the Atari game emulator or physics engines like MuJoCo, which can have episodes of different lengths. A naive way to use a GPU (left) would result in low performance for two reasons: 1) a batch size of one for the GPU, which fails to take advantage of its parallel computation abilities, and 2) idle time while the GPU waits for the CPU and vice versa. A multithreaded approach (center) allows for a more efficient use of the GPU by having multiple CPUs step the simulators in parallel, but causes the GPU to be idle while the CPUs are working and vice-versa. Our pipelined implementation (right) allows the GPU and CPU to operate efficiently. This approach also works with multiple GPUs and CPUs operating simultaneously, which is what we did in practice.

The impact of faster, cheaper experiments

Our code enables everyone in the research community, including students and self-taught learners, to rapidly experimentally iterate on training deep neural networks on challenging problems like Atari, which heretofore has been a luxury limited to only well-funded industry and academic labs.

Faster code begets research advances. For example, our new code enabled us to launch an extensive hyperparameters search for the genetic algorithm at a fraction of the cost, which led to performance improvements on most Atari games over those we originally reported. We have updated our original arXiv publication with these new results. The faster code is also catalyzing all of our current research into improving deep neuroevolution by shortening our iteration times, enabling us to try each new idea on more domains and run the algorithms longer.

Our new software repository includes an implementation of our deep genetic algorithm, the evolution strategies algorithm from Salimans et al., and our (surprisingly competitive!) random search control. We hope others will use our code to accelerate their own research activities. We also invite the community to build off our code to improve it. For example, further speedups are possible with distributed GPU training and with adding other TensorFlow operations customized for this type of computation.

There is a lot of momentum building around deep neuroevolution. In addition to our work and that by OpenAI mentioned above, there have also been recent deep learning advances using evolutionary algorithms from DeepMind, Google Brain, and Sentient. We hope open sourcing our code contributes to this momentum by making the field more accessible.

Most generally, our goal is to lower the cost of doing this research to the point where researchers of all backgrounds can try their own ideas for improving deep neuroevolution and harness it to accomplish their goals.

To be notified of future Uber AI Labs blog posts, please sign up for our mailing list, or you can subscribe to the Uber AI Labs YouTube channel. If you are interested in joining Uber AI Labs, please apply at Uber.ai.  

Subscribe to our newsletter to keep up with the latest innovations from Uber Engineering.

Comments
Felipe Petroski Such
Felipe Petroski Such is a Research Scientist in Uber AI Labs.
Kenneth O. Stanley on Twitter
Kenneth O. Stanley
Kenneth Stanley is a Senior Research Manager (Staff Scientist) at Uber AI Labs and a professor at the University of Central Florida.
Jeff Clune on Twitter
Jeff Clune
Jeff Clune is a Senior Research Manager (Staff Scientist) with Uber AI Labs and an associate professor at the University of Wyoming.