michal.i/o

❯

❯

❯

Distributed Training

Distributed Training

Jan 21, 20254 min read

distributed

TLDR

want large models, large batch sizes and fast training, need to distribute data, model and optimizer state across device / nodes
Communication overhead and pipeline bubbles become an issue

See ml-engineering/training/model-parallelism for a more detailed breakdown

Parallelism Types

[2411.13055] Hardware Scaling Trends and Diminishing Returns in Large-Scale Distributed Training

Data Parallel

Replicate Model on each device, split batch into sub batches for each device, synchronize at each step.\

Works great for smaller models that fit on a single device (ex: Vision Models)

Model Parallel

Tensor Parallel

Split Tensors into chunks

[Distributed w/ TorchTitan] Introducing Async Tensor Parallelism in PyTorch - torchtitan - PyTorch Forums

Pipeline Parallelism

Split model by layer, with each device holding a set of consecutive layers.

Naive pipeline parallelism leads to bubbles so batches are split into microbatches so that all devices can do processing in parallel on the sub batches.

Introducing GPipe, an Open Source Library for Efficiently Training Large-scale N
[2401.10241] Zero Bubble Pipeline Parallelism

DualPipe (DeepSeek v3)

Sequence Parallelism

For sequence models split the sequence and process subsequences on different devices

[2310.01889] Ring Attention with Blockwise Transformers for Near-Infinite Context

Expert Parallelism for MoEs

Put a subset of experts on different devices, see Expert Parallelism (EP)

Combinations

3D Parallelism (Data, Pipeline and Tensor)

Saving Memory

Activation Checkpointing

Gradient Accumulation

CPU Offloading

Quantization, Compression and Mixed Precision

Optimizers

1 Bit Adam

Communication

Primitives

Point-to-point communication — NCCL 2.23.4 documentation
Collective Operations — NCCL 2.23.4 documentation

AllReduce

all-gather, all-reduce, broadcast, reduce, reduce-scatter as well as point-to-point send and receive

Libraries / Backends

Communication Backends, Raw performance benchmarking · MLBench

NCCL

NVIDIA Collective Communication Library (NCCL) Documentation — NCCL 2.23.4 documentation

MPI

Gloo

Networking

NVlink

InfiniBand

Fault Tolerance

Distributed Training Frameworks

Torch FSDP (Fully Sharded Data Parallel) and FSDP2

[2304.11277] PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel
How Fully Sharded Data Parallel (FSDP) works? - YouTube
Lecture 12 (Part2): Maximize GPU Utilization - YouTube
torchtitan/docs/fsdp.md at main · pytorch/torchtitan · GitHub

Device Mesh

DeepSpeed

Training Overview and Features - DeepSpeed
Latest News - DeepSpeed
Stanford CS25: V4 I From Large Language Models to Large Multimodal Models - YouTube
DeepSpeed: Extreme-scale model training for everyone - Microsoft Research
Search - Microsoft Research

ZeRO1

ZeRO2

ZeRO3

Most memory taken up by optimizer states (adam moments need to be stored in full precision)

ZeRO++

[2306.10209] ZeRO++: Extremely Efficient Collective Communication for Giant Model Training

Megatron

FairScale

Ray Train

TorchTitan

[2410.06511v1] TorchTitan: One-stop PyTorch native solution for production ready LLM pre-training

Decentralized Multi Datacenter / Federated Training / Asynchronous Training

INTELLECT–1: Launching the First Decentralized Training of a 10B Parameter Model
Multi-Datacenter Training: OpenAI’s Ambitious Plan To Beat Google’s Infrastructure
Arthur Douillard on X: “@dylan522p You may want to read into the latest advance of dist/fed learning for LLMs, e.g. DiLoCo, OpenDiLoCo, Flower. You could summarize those as the Branch-Train-Merge you mention, but on steroids. https://t.co/MrRrIOkj8A” / X
Decentralized Training of Deep Learning Models
GitHub - PrimeIntellect-ai/prime: prime (previously called ZeroBand) is a framework for efficient, globally distributed training of AI models over the internet.

Async SGD

Links

Distributed Training Of Deep Learning Models : Part ~ 1 distributed
How to train a model on 10k H100 GPUs?distributed
Self contained example of how pipeline parallel works (AFAB and 1F1B) in 200 LOC · GitHub
ml-engineering/training/model-parallelism at master · stas00/ml-engineering · GitHub
GitHub - pytorch/torchtitan: A native PyTorch Library for large model training
Flash LLM - Sasha Rush LLM Training Short Series - YouTube
[2407.20018] Efficient Training of Large Language Models on Distributed Infrastructures: A Survey
Slaying OOMs with PyTorch FSDP and torchao - YouTube
LLM-Training-Puzzles/Distributed.ipynb at main · srush/LLM-Training-Puzzles · GitHub
DeepSpeed
Torch FSDP
GitHub - pytorch/torchtitan: A native PyTorch Library for large model training
Fairscale
Megatron
Torchtitan
- GitHub - pytorch/torchtitan: A native PyTorch Library for large model training
- [2410.06511] TorchTitan: One-stop PyTorch native solution for production ready LLM pre-training
Nanotron
GitHub - huggingface/picotron: Minimalistic 4D-parallelism distributed training framework for education purpose

Backlinks

No backlinks found

Graph View

Created with Quartz v4.4.0 © 2025