TLDR
- want large models, large batch sizes and fast training, need to distribute data, model and optimizer state across device / nodes
- Communication overhead and pipeline bubbles become an issue
See ml-engineering/training/model-parallelism for a more detailed breakdown
Parallelism Types
Data Parallel
Replicate Model on each device, split batch into sub batches for each device, synchronize at each step.\
Works great for smaller models that fit on a single device (ex: Vision Models)
Model Parallel
Tensor Parallel
Split Tensors into chunks
Pipeline Parallelism
Split model by layer, with each device holding a set of consecutive layers.
Naive pipeline parallelism leads to bubbles so batches are split into microbatches so that all devices can do processing in parallel on the sub batches.
- Introducing GPipe, an Open Source Library for Efficiently Training Large-scale N
- [2401.10241] Zero Bubble Pipeline Parallelism
Sequence Parallelism
For sequence models split the sequence and process subsequences on different devices
Expert Parallelism for MoEs
Put a subset of experts on different devices, see Expert Parallelism (EP)
Combinations
3D Parallelism (Data, Pipeline and Tensor)
Saving Memory
Activation Checkpointing
Gradient Accumulation
CPU Offloading
Quantization, Compression and Mixed Precision
Optimizers
1 Bit Adam
Communication
Primitives
- Point-to-point communication — NCCL 2.23.4 documentation
- Collective Operations — NCCL 2.23.4 documentation
AllReduce
all-gather, all-reduce, broadcast, reduce, reduce-scatter as well as point-to-point send and receive
Libraries / Backends
NCCL
MPI
Gloo
Networking
NVlink
InfiniBand
Fault Tolerance
Distributed Training Frameworks
Torch FSDP (Fully Sharded Data Parallel) and FSDP2
- [2304.11277] PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel
- How Fully Sharded Data Parallel (FSDP) works? - YouTube
- Lecture 12 (Part2): Maximize GPU Utilization - YouTube
- torchtitan/docs/fsdp.md at main · pytorch/torchtitan · GitHub
Device Mesh
DeepSpeed
- Training Overview and Features - DeepSpeed
- Latest News - DeepSpeed
- Stanford CS25: V4 I From Large Language Models to Large Multimodal Models - YouTube
- DeepSpeed: Extreme-scale model training for everyone - Microsoft Research
- Search - Microsoft Research
ZeRO1
ZeRO2
ZeRO3
- Most memory taken up by optimizer states (adam moments need to be stored in full precision)
ZeRO++
Megatron
FairScale
Ray Train
TorchTitan
Decentralized Multi Datacenter / Federated Training / Asynchronous Training
- INTELLECT–1: Launching the First Decentralized Training of a 10B Parameter Model
- Multi-Datacenter Training: OpenAI’s Ambitious Plan To Beat Google’s Infrastructure
- Arthur Douillard on X: “@dylan522p You may want to read into the latest advance of dist/fed learning for LLMs, e.g. DiLoCo, OpenDiLoCo, Flower. You could summarize those as the Branch-Train-Merge you mention, but on steroids. https://t.co/MrRrIOkj8A” / X
- Decentralized Training of Deep Learning Models
- GitHub - PrimeIntellect-ai/prime: prime (previously called ZeroBand) is a framework for efficient, globally distributed training of AI models over the internet.
Async SGD
Links
-
Distributed Training Of Deep Learning Models : Part ~ 1distributed
-
Self contained example of how pipeline parallel works (AFAB and 1F1B) in 200 LOC · GitHub
-
ml-engineering/training/model-parallelism at master · stas00/ml-engineering · GitHub
-
LLM-Training-Puzzles/Distributed.ipynb at main · srush/LLM-Training-Puzzles · GitHub
-
DeepSpeed
-
Torch FSDP
-
GitHub - pytorch/torchtitan: A native PyTorch Library for large model training
-
Fairscale
-
Megatron