Deep Learning At Scale: Parallel Model Training | by Caroline Arnold | Apr, 2024

Concept and a Pytorch Lightning example

Eight parallel neon bulbs in rainbow colors against a dark background. — Image created by the author using Midjourney.

Parallel training on a large number of GPUs is state of the art in deep learning. The open source image generation algorithm Stable Diffusion was trained on a cluster of 256 GPUs. Meta’s AI Research SuperCluster contains more than 24,000 NVIDIA H100 GPUs that are used to train models such as Llama 3.

By using multiple GPUs, machine learning experts reduce the wall time of their training runs. Training Stable Diffusion took 150,000 GPU hours, or more than 17 years. Parallel training reduced that to 25 days.

There are two types of parallel deep learning:

Data parallelism, where a large dataset is distributed across multiple GPUs.
Model parallelism, where a deep learning model that is too large to fit on a single GPU is distributed across multiple devices.

We will focus here on data parallelism, as model parallelism only becomes relevant for very large models beyond 500M parameters.

Beyond reducing wall time, there is an economic argument for parallel training: Cloud compute providers such as AWS offer single machines with up to 16 GPUs. Parallel training can take advantage of all available GPUs, and you get more value for your money.

Deep Learning At Scale: Parallel Model Training | by Caroline Arnold | Apr, 2024

Concept and a Pytorch Lightning example

Recent Articles

Using Machine Learning in Customer Segmentation

NYT ‘Connections’ hints and answers for July 27: Tips to solve ‘Connections’ #412.

Crooks Bypassed Google’s Email Verification to Create Workspace Accounts, Access 3rd-Party Services – Krebs on Security

🤖 The AI Developer’s Toolkit: Essential Skills and Resources [2023 Edition] 🔧 | by Jett Black | Jul, 2024

Find answers accurately and quickly using Amazon Q Business with the SharePoint Online connector

Related Stories

Leave A Reply Cancel reply