clipvision-language
- CLIP 1B Vit-Huge
- SwitchBack
- int8 linear
- 13-25% speedup
- ~90% of transformer compute spent in linear layers
- quant
- quant noise grows with matrix multiply inner dimension size
- happens with CLIP due to large batch size requirement
- use 16bit precision for gradient of weight multiplication
- int8 for forward and input grads
- provide Triton kernel for SwitchBack
- reduce large magnitude features
- layer scale init to 0
- loss spikes occur 1-8 iterations after the squared gradients become under-estimated by their AdamW second moment estimator.
- use AdamW-Adafactor, works better than grad clipping
- StableAdamW == AdamW-Adafactor
- AdamW + Adafactor clipping
- tracks the average ratio of the gradient square to the second moment estimator and lowers the learning rate when the ratio is large