- Dataset Augmentation
- CoCa for Captions (multiple per image), in addition to source captions
- Text and Image Embeddings from larger CLIP Models
- embed multiple augmented versions of the images and synthetic captions
- use multiple CLIP models in ensemble
- store augmentation params and use them at train time to reproduce augmented version of image
- Loss
- CLIP loss + distillation term
- compute on real and synth data and sum up for final loss
- Models
- Text-RepMixer
- Vision - FastViT variant called MCi
- reduce MLP expansion ratio from 4 to 3, because of “significant amount of redundancy in linear layers”, make the model deeper instead
- MCi2 matches FastViT on ImageNet (84.5%) while being 15% faster and 14.3% smaller
- Text-RepMixer
- Training
- 12M
- 8 A100s
- 8,192 Batch Size
- 1B
- 256 A100s
- 65,536 Batch Size
- Dataset Reinforcement
- 5 synthetic captions per image using the
coca_ViT-L-14
model in OpenCLIP - concatenate two CLIP image embeddings (datacomp and openai ViT-L-14)
- store in Bfloat16
- use gzipped pickle
- 5 synthetic captions per image using the
- Strong Augmentation
- 12M
- Inference
- iPhone 12 with CoreML
- Ideas
- Captioning model that takes image and source caption when generating new captions
- potentially also use nearest neighbors
- Captioning model that takes image and source caption when generating new captions