- pretraining
- filtered for quality
- include instruction tuning data
- synthetic data
- weighted sampling from different sources / categories
- long context training
- annealing with high quality data
- supervised finetuning
- RLHF / DPO
Optimizations
Quantized Optimizers
Fused Ops
- GitHub - unslothai/unsloth: Finetune Llama 3.2, Mistral, Phi & Gemma LLMs 2-5x faster with 80% less memory
- GitHub - linkedin/Liger-Kernel: Efficient Triton Kernels for LLM Training
Compile
FlexAttention
Block causal mask to pack samples