Horovod v0.21: Optimizing Network Utilization with Local Gradient Aggregation and Grouped Allreduce

0
Horovod v0.21: Optimizing Network Utilization with Local Gradient Aggregation and Grouped Allreduce

We originally open-sourced Horovod in 2017, and since then it has grown to become the standard solution in industry for scaling deep learning training to hundreds of GPUs.  With Horovod, you can reduce training times from days or weeks to hours or minutes by adding just a few lines of Python code to an existing TensorFlow, PyTorch, or Apache MXNet training script.

The Horovod community continues to grow, recently surpassing 10k stars on GitHub and reaching graduation status within the Linux Foundation for AI & Data. Today, we’re excited to announce v0.21, which brings a number of powerful new features to the Horovod community that make training deep learning models at scale faster and easier than ever before.

New features in v0.21 include:

  • Local gradient aggregation for TensorFlow v1 and v2 by Determined AI.
  • Grouped allreduce to reduce latency and improve determinism by Nvidia.
  • Support for easily provisioning Elastic Horovod jobs on Ray by Anyscale
  • Support for Horovod Spark Estimators within the Databricks Runtime by Databricks.

Local Gradient Aggregation

Local gradient aggregation is a technique to reduce communication overhead in network bandwidth constrained settings, when your GPU will process batches much faster than your network can transmit the gradients for aggregation across workers. It works by accumulating gradient updates to the model locally in GPU memory, summing the latest updates with the previously accumulated updates since the last round of communication. After a configured number of mini-batches N, Horovod performs an allreduce to average gradients across workers over the network, then applies the updates to the model.

Local gradient aggregation is equivalent to using larger effective batch sizes. However, unlike directly increasing the batch size, the effective batch size increase when you use local gradient aggregation is not limited by the available GPU memory. Workers increase their amount of computation per unit of communication. In effect, local gradient aggregation both increases the batch size and reduces communication overhead by a factor of N. If communication overhead limits your model training performance, you can enable local gradient aggregation to increase throughput.

In the past, Horovod only supported local gradient aggregation using PyTorch. Now in v0.21, Determined AI has contributed full support for local gradient aggregation in TensorFlow, including v1 and v2, in eager and graph modes and in Keras.

You can enable local gradient aggregation by setting the parameter backward_passes_per_step to the aggregation frequency that you want when you initialize the Horovod optimizer:

For more details on how local gradient aggregation works and how you can start using it with the Determined platform, see the full blog post by Determined AI here: Optimizing Horovod with Local Gradient Aggregation.

Grouped Allreduce

The paper Exascale Deep Learning for Scientific Inverse Problems introduced the idea of Grouped Allreduce as a method to optimize Horovod’s performance when training at supercomputing scale.  In v0.21, Nvidia has contributed a general API to bring the performance improvements of Grouped Allreduce to the open source Horovod community. 

Grouped Allreduce gives users explicit control over how Horovod fuses (or “groups”) tensors for allreduce. Specifically, when you provide a list of tensors to hvd.grouped_allreduce, it treats it logically as a single request, and the backend will only process it when all tensors in the list are available. 

This contrasts with Horovod’s normal approach to tensor fusion, where it batches small requests together into a single allreduce.  Without grouped allreduce, Horovod will greedily fuse any tensors ready to be allreduced as they become available. While this greedy fusing is appropriate in many situations, a number of circumstances can arise where you may want greater control over how this fusion is done.

One situation is when you want to reduce the latency of Horovod coordination / negotiation via reducing the HOROVOD_CYCLE_TIME, but also want to ensure that fused allreduce messages do not become too small. This was not previously possible as the fusion message sizes and cycle time were tightly coupled. By defining explicit groups, you are free to reduce the cycle time to as low a value as required for faster negotiation and coordination, while maintaining reasonable message sizes for network efficiency.

A second situation is when you want deterministic operation from Horovod. Dynamic packing of the fusion buffer can cause allreduce results to be non-deterministic. This is because the location of tensors in the fusion buffer can impact summation order. Previously, the only method to get deterministic results from Horovod is to disable fusion completely with the setting HOROVOD_FUSION_THRESHOLD=0. However this comes with an associated loss in throughput. Now, by explicitly defining the fusion groups, you can get deterministic fusion, as groupings can guarantee that Horovod packs the fusion buffers with a deterministic ordering (both iteration to iteration, and run to run) by setting the environment variable HOROVOD_DISABLE_GROUP_FUSION. This setting prevents groups from fusing into larger groups, which would otherwise reintroduce non-determinism.

You can enable Grouped Allreduce by setting the num_groups parameter of the DistributedOptimizer, or by directly calling hvd.grouped_allreduce on a list of tensors:

By setting num_groups, Horovod automatically splits the list of gradient tensors into the requested number of groups. For more control over group placement, you can apply the hvd.grouped_allreduce function directly to a list of tensors.

Elastic Horovod on Ray

In v0.20, we introduced Elastic Horovod, our auto-scaling and fault-tolerant API for TensorFlow and PyTorch. Now in v0.21, you can launch such jobs on preemptible cloud instances with a just a few lines of code using Horovod on Ray’s Elastic Executor:

This API brings fault-tolerant, auto-scaling distributed training to your existing Ray cluster. Unlike traditional distributed training jobs, with the ElasticRayExecutor, you can safely train on a cluster of preemptible / spot instances that may come and go from the cluster throughout the training process.

Horovod Spark Estimators on Databricks

Horovod Spark Estimators enables you to train a deep learning model with Horovod as part of any PySpark Pipeline.  Now in v0.21.0, Databricks have added support for running Horovod Spark Estimators in the Databricks Runtime for Machine Learning environment (AWS | Azure).

To get started, just create a DBFSLocalStore where Databricks will store the intermediate training data:

The DBFSLocalStore uses Databricks File System (DBFS) local file APIs (AWS | Azure) as a store of intermediate data and training artifacts.

Apache Spark 3.0 added GPU-aware scheduling to Spark, and with Databricks Runtime 7.0 ML GPU and above this capability comes pre-configured. See GPU scheduling instructions (AWS | Azure) for details.

With the Estimator API, Horovod launches tasks on each worker equal to the number of GPUs on each worker, and each task will affinitize to a single GPU assigned by Spark.

For more fine-grained control, Horovod also offers the Run API, so you can train with a lambda function specifying the training logic. With this API, the function get_available_devices() from horovod.spark.task returns a list of assigned GPUs for the Spark task from which it is called. See keras_spark3_rossmann.py for an example of using get_available_devices() with the Run API.

What’s Next?

Our next major milestone for the Horovod project is v1.0, solidifying the core API and the newly introduced Elastic Horovod API.  As part of this effort, we’ll be focusing on the following priorities:

  • Higher level API to simplify Elastic training and data loading.
  • Feature parity between all supported frameworks: TensorFlow, PyTorch, and MXNet.
  • Improved error handling, messaging, and debuggability.
  • Slow worker mitigation / removal with Elastic Horovod.

Additionally, in the coming weeks we’ll be talking more about our recent collaboration with Anyscale on Distributed Hyperparameter Search using Ray Tune and our plans to expand upon this work in the near future.

To learn more, be sure to check out Horovod on GitHub and the full release notes for v0.21.0 here.  Look forward to seeing you there!

Acknowledgements 

We want to thank our amazing community of individual and corporate contributors, including:

  • Aaron Harlap and Neil Conway from Determined AI, for their work on Local Gradient Aggregation.
  • Josh Romero from Nvidia, for his work on Grouped Allreduce and other improvements to Horovod’s performance.
  • Richard Liaw from Anyscale, for his work on Horovod on Ray, including Elastic Horovod on Ray and Ray Tune integration.
  • Liang Zhang and Xiangrui Meng from Databricks, for their work on integrating Horovod into the Databricks Runtime.
  • Enrico Minack from G-Research, for his work on improving Spark integration and Horovod’s continuous integration system.
  • Leonard Lausen from Amazon, for his work on supporting MXNet integration and improving the Horovod build system.
  • Nicolas Castet from Intel, for his work on migrating the Horovod build system to CMake.

We would also like to thank the Linux Foundation for continuing to support the Horovod project, and AWS for providing credits to support our continuous integration system.

Comments

No posts to display