michal.i/o

❯

❯

❯

Optimization

Jan 21, 20254 min read

optim

Optimizers

SGD + Momentum

Adam

AdamW

LAMB

Lion

automl/lion at master · google/automl · GitHub
GitHub - lucidrains/lion-pytorch: 🦁 Lion, new optimizer discovered by Google Brain using genetic algorithms that is purportedly better than Adam(w), in Pytorch
- [WIP] Testing the lion optimizer by mitchellnw · Pull Request #432 · mlfoundations/open_clip · GitHub
- discuss whether it worked or didn’t work · lucidrains/lion-pytorch · Discussion #1 · GitHub
Lion 8 bit by lucidrains · Pull Request #188 · TimDettmers/bitsandbytes · GitHub

Adam-mini

GitHub - zyushun/Adam-mini: Code for Adam-mini: Use Fewer Learning Rates To Gain More https://arxiv.org/abs/2406.16793

ADOPT

GitHub - iShohei220/adopt: Official Implementation of “ADOPT: Modified Adam Can Converge with Any β2 with the Optimal Rate”

MARS

GitHub - AGI-Arena/MARS: The official implementation of MARS: Unleashing the Power of Variance Reduction for Training Large Models

Cautious Optimizer

GitHub - kyleliang919/C-Optim: When it comes to optimizers, it’s always better to be safe than sorry

Kron

fsdp_optimizers/kron_mars.py at main · ethansmith2000/fsdp_optimizers · GitHub

Muon

fsdp_optimizers/soap.py at main · ethansmith2000/fsdp_optimizers · GitHub
Muon: An optimizer for hidden layers in neural networks | Keller Jordan blog

SOAP

fsdp_optimizers/soap.py at main · ethansmith2000/fsdp_optimizers · GitHub

PSGD

GitHub - lixilinx/psgd_torch: Pytorch implementation of preconditioned stochastic gradient descent (Kron and affine preconditioner, low-rank approximation preconditioner and more)
HeavyBall/heavyball at main · ClashLuke/HeavyBall · GitHub

Schedule Free

Aaron Defazio Talk (12.06.2024, UCLA) - YouTube
- Linear Decay works better than Cosine when combined with warmup

Shampoo

AdEMAMix

GitHub - nanowell/AdEMAMix-Optimizer-Pytorch: The AdEMAMix Optimizer: Better, Faster, Older.

Laprop

HeavyBall/heavyball/foreach_laprop.py at main · ClashLuke/HeavyBall · GitHub

APOLLO: SGD-like Memory, AdamW-level Performance

[2412.05270] APOLLO: SGD-like Memory, AdamW-level Performance
GitHub - zhuhanqing/APOLLO: APOLLO: SGD-like Memory, AdamW-level Performance

SAM - Sharpness-Aware Minimization

[2010.01412] Sharpness-Aware Minimization for Efficiently Improving Generalization

Learned Optimizers

[2406.00153] $μ$ LO: Compute-Efficient Meta-Generalization of Learned Optimizers
- GitHub - bentherien/mu_learned_optimization: [Oral; Neurips OPT2024 ] μLO: Compute-Efficient Meta-Generalization of Learned Optimizers

Warmup

Schedules

Regularization

Weight Decay

Gradient Clipping

EMA - Exponential Moving Averages of Parameters

Batch Size vs Learning Rate

Distributed

distributed_shampoo

GitHub - nikhilvyas/SOAP
[2409.03137] The AdEMAMix Optimizer: Better, Faster, Older

Hyperparameter Optimization

Optuna: A hyperparameter optimization framework
Tune: Scalable Hyperparameter Tuning — Ray 1.13.0
GitHub - facebookresearch/nevergrad: A Python toolbox for performing gradient-free optimization
GitHub - Facebook/Ax: Adaptive Experimentation Platform
GitHub - pytorch/botorch: Bayesian optimization in PyTorch
GitHub - google/vizier: Python-based research interface for blackbox and hyperparameter optimization, based on the internal Google Vizier Service.

muP

The Practitioner’s Guide to the Maximal Update Parameterization | EleutherAI Blog

Evolutionary

Tips

For small batch sizes - AdamW with small decay

Large batch sizes (say 1000s) - LAMB. LAMB is ~ AdamW + warmup + cosine decay rolled into one and all you need to decide on is a learning rate (3e-4 :)

https://twitter.com/sgondala2/status/1555677621748346880

LR warmup was one of the things looked at in https://arxiv.org/abs/2110.04369, the idea being it helps you use a higher peak LR which lets you overall train faster. so when ablating I’d also try adding a warmup and also increasing your max LR

https://twitter.com/zacharynado/status/1555920966982803457

…I have done extensive experiments of schedules like cos vs linear vs staircase etc decays. If each one is tuned correctly, they always end up giving the same results. So I think it’s not worth spending time on that.

https://twitter.com/giffmana/status/1555818856756793344

Just in case: warm-up seems to typically not be needed at small batch or with small models, so you may not notice its effect in “mini” experiments.

https://twitter.com/giffmana/status/1555819323586928640

I think it’s (warmup) mostly to do with the starting beta’s of Adam being poorly chosen. The linear warmup gives the exponential moving averages some time to get the correct values before learning starts in earnest.

https://twitter.com/pbloemesquire/status/1555834823578632195

My experience, gained from fine tuning with small data, is that lr schedule dramatically impacts the end result accuracy. And yes, some form of warmup helps a lot.

https://twitter.com/JFPuget/status/1555836240968093696

Also, gradient clipping hard clip by value at +-3 to 5 is very useful. I’m generally a fan of aggressive learning rates / schedules + aggressive clipping settings, with some long decay schedules and adaptive gradient methods (e.g. Adam family) for my model training.

https://twitter.com/kastnerkyle/status/1555565583294496769

Backlinks

No backlinks found

Graph View

Created with Quartz v4.4.0 © 2025