Pretraining - Large Scale Training Tricks

January 29, 2025 1 min read

Data and Datasets

Balancing

Curriculum Learning

Annealing on High Quality Data

Stability

[2410.16682v1] Methods of improving LLM training stability
- “input activations and outputs of linear layers of a diverging model have much higher L2 norms in comparison to a converging one”
- “QK norm cap, learning rate can be increased by 1.5x (without model divergence) in comparison to a QK norm”

QK Norm

Softcapping

Optimization

Optimizers

Learning Schedules

WSD - Warmup, Stable, Decay

allows you to continue pretraining

Hyperparameter Tunning

muP
max learning rate

Distributed Training

Distributed Training

Monitoring and Logging

Things to Track

Activation, Gradient and Weight stats (min, max, L2 Norms)
Eval stats on subsets of the data (sources)
Losses for early and late tokens
- top losses (samples)

Issues

Reports

Meta LLAMA logs
BLOOM
LLM360
Allen Institute

ml