2023-04-14 - Combined Scaling for Zero-shot Transfer Learning

Intorduces a new combined scaling method called “BASIC”

scales up contrastive image-text models to larger model, data and batch size
achieves 85.7% top-1 accuracy on the ImageNet
beats CLIP by 9%
6.6B image-text pair noisy dataset (JFT Dataset)
uses gradient accumulation and checkpointing to expand the batch size to 2^20
- accumulates gradient for contrastive matrix
- recomputes features in backprop to save having to store the gradients (gradient checkpointing)
  - rematerialize activations and normalization layers since they are cheap to compute but use a lot of memory for gradients
Use “AdaFactorW” optimizer (AdamW + AdaFactor)B

michal.i/o