Optimizers
SGD + Momentum
Adam
AdamW
LAMB
Lion
- automl/lion at master · google/automl · GitHub
- GitHub - lucidrains/lion-pytorch: 🦁 Lion, new optimizer discovered by Google Brain using genetic algorithms that is purportedly better than Adam(w), in Pytorch
- Lion 8 bit by lucidrains · Pull Request #188 · TimDettmers/bitsandbytes · GitHub
Adam-mini
ADOPT
MARS
Cautious Optimizer
- GitHub - kyleliang919/C-Optim: When it comes to optimizers, it’s always better to be safe than sorry
Kron
Muon
- fsdp_optimizers/soap.py at main · ethansmith2000/fsdp_optimizers · GitHub
- Muon: An optimizer for hidden layers in neural networks | Keller Jordan blog
SOAP
PSGD
Schedule Free
- Aaron Defazio Talk (12.06.2024, UCLA) - YouTube
- Linear Decay works better than Cosine when combined with warmup
Shampoo
AdEMAMix
Laprop
HeavyBall/heavyball/foreach_laprop.py at main · ClashLuke/HeavyBall · GitHub
APOLLO: SGD-like Memory, AdamW-level Performance
- [2412.05270] APOLLO: SGD-like Memory, AdamW-level Performance
- GitHub - zhuhanqing/APOLLO: APOLLO: SGD-like Memory, AdamW-level Performance
SAM - Sharpness-Aware Minimization
Learned Optimizers
Warmup
Schedules
Regularization
Weight Decay
Gradient Clipping
EMA - Exponential Moving Averages of Parameters
Batch Size vs Learning Rate
Distributed
Hyperparameter Optimization
- Optuna: A hyperparameter optimization framework
- Tune: Scalable Hyperparameter Tuning — Ray 1.13.0
- GitHub - facebookresearch/nevergrad: A Python toolbox for performing gradient-free optimization
- GitHub - Facebook/Ax: Adaptive Experimentation Platform
- GitHub - pytorch/botorch: Bayesian optimization in PyTorch
- GitHub - google/vizier: Python-based research interface for blackbox and hyperparameter optimization, based on the internal Google Vizier Service.
muP
Evolutionary
Tips
- For small batch sizes - AdamW with small decay
- Large batch sizes (say 1000s) - LAMB. LAMB is ~ AdamW + warmup + cosine decay rolled into one and all you need to decide on is a learning rate (3e-4 :)
https://twitter.com/sgondala2/status/1555677621748346880
LR warmup was one of the things looked at in https://arxiv.org/abs/2110.04369, the idea being it helps you use a higher peak LR which lets you overall train faster. so when ablating I’d also try adding a warmup and also increasing your max LR
https://twitter.com/zacharynado/status/1555920966982803457
…I have done extensive experiments of schedules like cos vs linear vs staircase etc decays. If each one is tuned correctly, they always end up giving the same results. So I think it’s not worth spending time on that.
https://twitter.com/giffmana/status/1555818856756793344
Just in case: warm-up seems to typically not be needed at small batch or with small models, so you may not notice its effect in “mini” experiments.
https://twitter.com/giffmana/status/1555819323586928640
I think it’s (warmup) mostly to do with the starting beta’s of Adam being poorly chosen. The linear warmup gives the exponential moving averages some time to get the correct values before learning starts in earnest.
https://twitter.com/pbloemesquire/status/1555834823578632195
My experience, gained from fine tuning with small data, is that lr schedule dramatically impacts the end result accuracy. And yes, some form of warmup helps a lot.
https://twitter.com/JFPuget/status/1555836240968093696
Also, gradient clipping hard clip by value at +-3 to 5 is very useful. I’m generally a fan of aggressive learning rates / schedules + aggressive clipping settings, with some long decay schedules and adaptive gradient methods (e.g. Adam family) for my model training.