TLDR

  • want large models, large batch sizes and fast training, need to distribute data, model and optimizer state across device / nodes
  • Communication overhead and pipeline bubbles become an issue

See ml-engineering/training/model-parallelism for a more detailed breakdown

Parallelism Types

Data Parallel

Replicate Model on each device, split batch into sub batches for each device, synchronize at each step.\

Works great for smaller models that fit on a single device (ex: Vision Models)

Model Parallel

Tensor Parallel

Split Tensors into chunks

Pipeline Parallelism

Split model by layer, with each device holding a set of consecutive layers.

Naive pipeline parallelism leads to bubbles so batches are split into microbatches so that all devices can do processing in parallel on the sub batches.

Sequence Parallelism

For sequence models split the sequence and process subsequences on different devices

Expert Parallelism for MoEs

Put a subset of experts on different devices, see Expert Parallelism (EP)

Combinations

3D Parallelism (Data, Pipeline and Tensor)

Saving Memory

Activation Checkpointing

Gradient Accumulation

CPU Offloading

Quantization, Compression and Mixed Precision

Optimizers

1 Bit Adam

Communication

Primitives

AllReduce

all-gather, all-reduce, broadcast, reduce, reduce-scatter as well as point-to-point send and receive

Libraries / Backends

NCCL

MPI

Gloo

Networking

InfiniBand

Fault Tolerance

Distributed Training Frameworks

Torch FSDP (Fully Sharded Data Parallel) and FSDP2

Device Mesh

DeepSpeed

ZeRO1

ZeRO2

ZeRO3

  1. Most memory taken up by optimizer states (adam moments need to be stored in full precision)

ZeRO++

Megatron

FairScale

Ray Train

TorchTitan

Decentralized Multi Datacenter / Federated Training / Asynchronous Training

Async SGD

Links