CLIP in GPT
- Use both multimodal encoders with a smaller “latent” decoder
|------tokens-------------|
|-------------------------small decoder-----------|
|BERT Tokens||VIT Tokens|
- share same text tokenizer
- input CLIP tokens as inputs to GPT or using cross attention
- keep classification / embedding token, use different or no loss on it
- cache the bert and vit encodings
- potentially prune or merge encoder tokens
- can do first search with embedding tokens, to feed into model for RAG / Context
- rerank with ColBERT / ColPALI style