CLIP in GPT

January 25, 2025 1 min read

Use both multimodal encoders with a smaller “latent” decoder

                        |------tokens-------------|
|-------------------------small decoder-----------|
|BERT Tokens||VIT Tokens|

share same text tokenizer
input CLIP tokens as inputs to GPT or using cross attention
keep classification / embedding token, use different or no loss on it
cache the bert and vit encodings
- potentially prune or merge encoder tokens
can do first search with embedding tokens, to feed into model for RAG / Context
rerank with ColBERT / ColPALI style

ml research-ideas