SGD + Momentum
Batch Size vs Learning Rate
- Optuna: A hyperparameter optimization framework
- Tune: Scalable Hyperparameter Tuning — Ray 1.13.0
- GitHub - facebookresearch/nevergrad: A Python toolbox for performing gradient-free optimization
- GitHub - Facebook/Ax: Adaptive Experimentation Platform
- GitHub - pytorch/botorch: Bayesian optimization in PyTorch
- For small batch sizes - AdamW with small decay
- Large batch sizes (say 1000s) - LAMB. LAMB is ~ AdamW + warmup + cosine decay rolled into one and all you need to decide on is a learning rate (3e-4 :)
LR warmup was one of the things looked at in https://arxiv.org/abs/2110.04369, the idea being it helps you use a higher peak LR which lets you overall train faster. so when ablating I'd also try adding a warmup and also increasing your max LR
...I have done extensive experiments of schedules like cos vs linear vs staircase etc decays. If each one is tuned correctly, they always end up giving the same results. So I think it's not worth spending time on that.
Just in case: warm-up seems to typically not be needed at small batch or with small models, so you may not notice its effect in "mini" experiments.
I think it's (warmup) mostly to do with the starting beta's of Adam being poorly chosen. The linear warmup gives the exponential moving averages some time to get the correct values before learning starts in earnest.
My experience, gained from fine tuning with small data, is that lr schedule dramatically impacts the end result accuracy. And yes, some form of warmup helps a lot.
Also, gradient clipping hard clip by value at +-3 to 5 is very useful. I'm generally a fan of aggressive learning rates / schedules + aggressive clipping settings, with some long decay schedules and adaptive gradient methods (e.g. Adam family) for my model training.