optim

Optimizers

SGD + Momentum

Adam

AdamW

LAMB

Lion

Adam-mini

ADOPT

MARS

Cautious Optimizer

Kron

Muon

SOAP

PSGD

Schedule Free

Shampoo

AdEMAMix

Laprop

HeavyBall/heavyball/foreach_laprop.py at main · ClashLuke/HeavyBall · GitHub

APOLLO: SGD-like Memory, AdamW-level Performance

SAM - Sharpness-Aware Minimization

Learned Optimizers

Warmup

Schedules

Regularization

Weight Decay

Gradient Clipping

EMA - Exponential Moving Averages of Parameters

Batch Size vs Learning Rate

Distributed

distributed_shampoo

Hyperparameter Optimization

muP

x.com

Evolutionary

EvoTorch

Tips

  1. For small batch sizes - AdamW with small decay
  2. Large batch sizes (say 1000s) - LAMB. LAMB is ~ AdamW + warmup + cosine decay rolled into one and all you need to decide on is a learning rate (3e-4 :)

https://twitter.com/sgondala2/status/1555677621748346880

LR warmup was one of the things looked at in https://arxiv.org/abs/2110.04369, the idea being it helps you use a higher peak LR which lets you overall train faster. so when ablating I’d also try adding a warmup and also increasing your max LR

https://twitter.com/zacharynado/status/1555920966982803457

…I have done extensive experiments of schedules like cos vs linear vs staircase etc decays. If each one is tuned correctly, they always end up giving the same results. So I think it’s not worth spending time on that.

https://twitter.com/giffmana/status/1555818856756793344

Just in case: warm-up seems to typically not be needed at small batch or with small models, so you may not notice its effect in “mini” experiments.

https://twitter.com/giffmana/status/1555819323586928640

I think it’s (warmup) mostly to do with the starting beta’s of Adam being poorly chosen. The linear warmup gives the exponential moving averages some time to get the correct values before learning starts in earnest.

https://twitter.com/pbloemesquire/status/1555834823578632195

My experience, gained from fine tuning with small data, is that lr schedule dramatically impacts the end result accuracy. And yes, some form of warmup helps a lot.

https://twitter.com/JFPuget/status/1555836240968093696

Also, gradient clipping hard clip by value at +-3 to 5 is very useful. I’m generally a fan of aggressive learning rates / schedules + aggressive clipping settings, with some long decay schedules and adaptive gradient methods (e.g. Adam family) for my model training.

https://twitter.com/kastnerkyle/status/1555565583294496769