Deep Learning At Scale: Parallel Model Training | by Caroline Arnold | Apr, 2024


Concept and a Pytorch Lightning example

Towards Data Science
Eight parallel neon bulbs in rainbow colors against a dark background.
Image created by the author using Midjourney.

Parallel training on a large number of GPUs is state of the art in deep learning. The open source image generation algorithm Stable Diffusion was trained on a cluster of 256 GPUs. Meta’s AI Research SuperCluster contains more than 24,000 NVIDIA H100 GPUs that are used to train models such as Llama 3.

By using multiple GPUs, machine learning experts reduce the wall time of their training runs. Training Stable Diffusion took 150,000 GPU hours, or more than 17 years. Parallel training reduced that to 25 days.

There are two types of parallel deep learning:

  • Data parallelism, where a large dataset is distributed across multiple GPUs.
  • Model parallelism, where a deep learning model that is too large to fit on a single GPU is distributed across multiple devices.

We will focus here on data parallelism, as model parallelism only becomes relevant for very large models beyond 500M parameters.

Beyond reducing wall time, there is an economic argument for parallel training: Cloud compute providers such as AWS offer single machines with up to 16 GPUs. Parallel training can take advantage of all available GPUs, and you get more value for your money.

Recent Articles

Related Stories

Leave A Reply

Please enter your comment!
Please enter your name here