Pretraining - Large Scale Training Tricks

Data and Datasets

Balancing

Curriculum Learning

Annealing on High Quality Data


Stability

QK Norm

Softcapping


Optimization

Optimizers

Learning Schedules

WSD - Warmup, Stable, Decay

Hyperparameter Tunning

Distributed Training

Distributed Training

Monitoring and Logging

Things to Track

Issues

Reports


ml