Using the Horovod distributed learning library in Azure Databricks
horovod is a library for distributed deep learning training. It supports commonly used frameworks such as TensorFlow, Keras, and PyTorch. As mentioned before, it is based on the
tensorflow-allreduce library and implements the
ring allreduce algorithm in order to ease the migration from single-graphics processing unit (GPU) training to parallel-GPU distributed training.
In order to do this, we adapt a single-GPU training script of a deep learning model to use the
horovod library during the training process. Once we have adapted the script, it can run on single or multiple GPUs without changes to the code.
horovod library uses a data parallelization strategy by allowing efficient distribution of the training to multiple GPUs in parallel in an optimized way, by implementing the
ring allreduce algorithm to overcome communication limitations.
It is implemented in a way that each GPU gets a mini-batch of data...