-
LLM-Training-Puzzles/Distributed.ipynb at main · srush/LLM-Training-Puzzles · GitHub
-
DeepSpeed
-
Torch FSDP
-
GitHub - pytorch/torchtitan: A native PyTorch Library for large model training
-
Fairscale
-
Megatron
Data Parallel
Model Parallel
Tensor Parallel
Split Tensors
Pipeline Parallelism
Split by layer
Expert Parallelism for MoEs
Activation Checkpointing
Gradient Accumulation
CPU Offloading
- How Fully Sharded Data Parallel (FSDP) works? - YouTube
- Lecture 12 (Part2): Maximize GPU Utilization - YouTube
Torch FSDP (Fully Sharded Data Parallel)
Device Mesh
DeepSpeed
ZeRO1
ZeRO2
ZeRO3
- Most memory taken up by optimizer states (adam moments need to be stored in full precision)