Pretraining - Large Scale Training Tricks
Data and Datasets
Balancing
Curriculum Learning
Annealing on High Quality Data
Stability
- [2410.16682v1] Methods of improving LLM training stability
- “input activations and outputs of linear layers of a diverging model have much higher L2 norms in comparison to a converging one”
- “QK norm cap, learning rate can be increased by 1.5x (without model divergence) in comparison to a QK norm”
QK Norm
Softcapping
Optimization
Optimizers
Learning Schedules
WSD - Warmup, Stable, Decay
- allows you to continue pretraining
Hyperparameter Tunning
- muP
- max learning rate
Distributed Training
Monitoring and Logging
Things to Track
- Activation, Gradient and Weight stats (min, max, L2 Norms)
- Eval stats on subsets of the data (sources)
- Losses for early and late tokens
- top losses (samples)
Issues
Reports
- Meta LLAMA logs
- BLOOM
- LLM360
- Allen Institute