Data Parallel

Model Parallel

Tensor Parallel

Split Tensors

Pipeline Parallelism

Split by layer

Expert Parallelism for MoEs

Activation Checkpointing

Gradient Accumulation

CPU Offloading

Torch FSDP (Fully Sharded Data Parallel)

Device Mesh

DeepSpeed

ZeRO1

ZeRO2

ZeRO3

  1. Most memory taken up by optimizer states (adam moments need to be stored in full precision)

Multi Datacenter / Federated

Async SGD