Data Sampling and Curation
Augmentation
Batch
Mixup
Augmentation Schedules
FixRes
Initialization
muP
Maximal Update Parametrization (μP)
Weight Averaging
SWA
Model Soup
Learning Rates
Learning Rate Range Test
Learning Rate Schedules
Warmup-Stable-Decay (WSD)
-
[2408.13359v1] Power Scheduler: A Batch Size and Token Number Agnostic Learning Rate Scheduler
-
Allows continued pretraining from checkpoint before decay