michal.i/o

❯

❯

❯

pytorch

Jan 21, 20252 min read

PyTorch

Performance Optimizations

Performance Tuning Guide — PyTorch Tutorials 2.0.0+cu117 documentation
Efficient Training on Multiple GPUs
GitHub - LukasHedegaard/pytorch-benchmark: Easily benchmark PyTorch model FLOPs, latency, throughput, allocated gpu memory and energy consumption

Model Initialization

Making model initialization faster
- use context manager with __torch_function__ to skip CPU init

Mixed Precision

GitHub - TimDettmers/bitsandbytes: 8-bit CUDA functions for PyTorch
GitHub - Azure/MS-AMP: Microsoft Automatic Mixed Precision Library

Quantization

GitHub - pytorch-labs/ao: The torchao repository contains api’s and workflows for quantization and pruning gpu models.

Data Loading

GitHub - libffcv/ffcv: FFCV: Fast Forward Computer Vision (and other ML workloads!)
GitHub - mosaicml/streaming: A Data Streaming Library for Efficient Neural Network Training

torch.compile

GitHub - pytorch-labs/segment-anything-fast: A batched offline inference oriented version of segment-anything
GitHub - pytorch-labs/gpt-fast: Simple and efficient pytorch-native transformer text generation in <1000 LOC of python.

JIT

TorchDynamo

https://github.com/pytorch/torchdynamo

ex: https://github.com/pytorch/torchdynamo/blob/main/benchmarks/training_loss.py

TorchInductor

https://dev-discuss.pytorch.org/t/torchinductor-a-pytorch-native-compiler-with-define-by-run-ir-and-symbolic-shapes/747

AITemplate

https://github.com/facebookincubator/AITemplate

Distributed

https://lambdalabs.com/blog/multi-node-pytorch-distributed-training-guide/
GitHub - pytorch/PiPPy: Pipeline Parallelism for PyTorch
Getting Started with Fully Sharded Data Parallel(FSDP) — PyTorch Tutorials 2.1.1+cu121 documentation
- train.py
Efficient Training on Multiple GPUs

Debugging and Profiling

Memory

Debugging PyTorch memory use with snapshots
A guide to PyTorch’s CUDA Caching Allocator

Utility Libraries

toolbox/pytorch at master · stas00/toolbox · GitHub

FlexAttention

FlexAttention: The Flexibility of PyTorch with the Performance of FlashAttention | PyTorch
GitHub - shreyansh26/Attention-Mask-Patterns: Using FlexAttention to compute attention with different masking patterns

PyTorch
Performance Optimizations
Model Initialization
Mixed Precision
Quantization
Data Loading
torch.compile
JIT
TorchDynamo
TorchInductor
AITemplate
Distributed
Debugging and Profiling
Memory
Utility Libraries
FlexAttention

Backlinks

No backlinks found

Graph View

Created with Quartz v4.4.0 © 2025