PyTorch
Performance Optimizations
- Performance Tuning Guide — PyTorch Tutorials 2.0.0+cu117 documentation
- Efficient Training on Multiple GPUs
- GitHub - LukasHedegaard/pytorch-benchmark: Easily benchmark PyTorch model FLOPs, latency, throughput, allocated gpu memory and energy consumption
Model Initialization
- Making model initialization faster
- use context manager with
__torch_function__
to skip CPU init
- use context manager with
Mixed Precision
- GitHub - TimDettmers/bitsandbytes: 8-bit CUDA functions for PyTorch
- GitHub - Azure/MS-AMP: Microsoft Automatic Mixed Precision Library
Quantization
Data Loading
- GitHub - libffcv/ffcv: FFCV: Fast Forward Computer Vision (and other ML workloads!)
- GitHub - mosaicml/streaming: A Data Streaming Library for Efficient Neural Network Training
torch.compile
- GitHub - pytorch-labs/segment-anything-fast: A batched offline inference oriented version of segment-anything
- GitHub - pytorch-labs/gpt-fast: Simple and efficient pytorch-native transformer text generation in <1000 LOC of python.
JIT
TorchDynamo
https://github.com/pytorch/torchdynamo
ex: https://github.com/pytorch/torchdynamo/blob/main/benchmarks/training_loss.py
TorchInductor
AITemplate
https://github.com/facebookincubator/AITemplate
Distributed
-
https://lambdalabs.com/blog/multi-node-pytorch-distributed-training-guide/
-
Getting Started with Fully Sharded Data Parallel(FSDP) — PyTorch Tutorials 2.1.1+cu121 documentation