TLDR
- Labeling data sucks, we have paired image - text data that we can learn from in a “self” supervised fashion
- Contrastive models are good for retrieval but:
- require huge batches to train
- collapse to bag of word model and don’t have “semantic” understanding of images
- Autoregressive Vision Language models learn richer representations of text and offer a lot of other advantages that come with next token prediction:
- in context learning
- prompting
- LLaVA style architecture is standard (MLP to align pretrained vision encoder outputs with pretrained LLM inputs)
Contrastive Vision Language Models
Train separate backbones for image and text, align them with a contrastive objective ala standard two tower approach
CLIP
- https://github.com/mlfoundations/open_clip
- GitHub - baaivision/EVA: EVA Series: Vision Foundation Model Fanatics from BAAI
- https://github.com/OpenAI/CLIP
- https://github.com/LAION-AI/CLIP_benchmark
- https://github.com/rom1504/clip-retrieval
- https://github.com/rom1504/laion-prepro
- GitHub - yzhuoning/Awesome-CLIP: Awesome list for research on CLIP (Contrastive Language-Image Pre-Training).
- e-CLIP: Large-Scale Vision-Language Representation Learning in E-commerce
- GitHub - facebookresearch/SLIP: Code release for SLIP Self-supervision meets Language-Image Pre-training
- GitHub - facebookresearch/flip: Official Open Source code for “Scaling Language-Image Pre-training via Masking”
- GitHub - facebookresearch/diht: Filtering, Distillation, and Hard Negatives for Vision-Language Pre-Training
- GitHub - OliverRensu/DeepMIM
- [2311.17049] MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training
- GitHub - bytedance/fc-clip: [NeurIPS 2023] This repo contains the code for our paper Convolutions Die Hard: Open-Vocabulary Segmentation with Single Frozen Convolutional CLIP
- [2307.16634] CDUL: CLIP-Driven Unsupervised Learning for Multi-Label Image Classification
- [2309.05551] OpenFashionCLIP: Vision-and-Language Contrastive Learning with Open-Source Fashion Data
- [2410.18857] Probabilistic Language-Image Pre-Training
- [2410.05270] Fine-Tuning CLIP’s Last Visual Projector: A Few-Shot Cornucopia
SigLIP
Extensions
FLAIR
Autoregressive Vision Language Models
- [2306.07915v3] Image Captioners Are Scalable Vision Learners Too
- [2311.03079] CogVLM: Visual Expert for Pretrained Language Models
- GitHub - OpenBMB/MiniCPM-V: MiniCPM-V 2.6: A GPT-4V Level MLLM for Single Image, Multi Image and Video on Your Phone
- GitHub - FreedomIntelligence/LongLLaVA: LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture
- [Stanford CS25: V4 I From Large Language Models to Large Multimodal Models - YouTube](https://www.youtube.com/watch?v=cYfKQ6YG9Qo)
- Recent Advances on Multimodal Models Pre-training - YouTube
- GitHub - OpenGVLab/InternVL: [CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4o. 接近GPT-4o表现的开源多模态对话模型
CapPa
- [2306.07915v5] Image Captioners Are Scalable Vision Learners Too
- Weights & Biases - # CapPa: Training vision models as captioners
LLaVA
LLaVA-OneVision
Flamingo
Qwen2 VL
Pixtral
InternVL
InternVL 2.5
MiniCPM
Molmo
LLAMA 3.2 Vision
Chameleon
AIMv2 (Multimodal Autoregressive Pre-training of Large Vision Encoders)
- GitHub - apple/ml-aim: This repository provides the code and model checkpoints for AIMv1 and AIMv2 research projects.
- [2401.08541] Scalable Pre-training of Large Autoregressive Image Models
Mixture-of-Transformer
NVLM
PaliGemma
CogVLM
Aria
Florence-VL
SmolVLM
Moondream
Other Approaches
Aligning Vision and LLMs
Multi Task / Combined Objectives
VladVA
Generative VLMs
-
[2412.03069] TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation
-
[2412.04332] Liquid: Language Models are Scalable Multi-modal Generators
Patterns
Data Curation / Data Augmentation
Filter for Quality
CLIP score to filter out web data that has captions that don’t align with the image
Expand Captions
Use existing VLMs to generate more detailed captions
Modality Alignment
MLP projection from Vision Backbone to LLM
Image Encoding
Multiple Resolutions, Scales and Aspect Ratios
End of Row Encoding
Positional Encodings
Token Compression / Dropping
Early vs Late Fusion
Datasets
- https://github.com/rom1504/laion-prepro
- GitHub - allenai/mmc4: MultimodalC4 is a multimodal extension of c4 that interleaves millions of images with text.
- nyu-visionx/Cambrian-10M · Datasets at Hugging Face
- allenai/pixmo-cap-qa · Datasets at Hugging Face
Leaderboard / Benchmarks
Links
- GitHub - microsoft/BridgeTower: Open source code for AAAI 2023 Paper “BridgeTower: Building Bridges Between Encoders in Vision-Language Representation Learning”
- GitHub - microsoft/react
- https://github.com/salesforce/lavis
- https://github.com/uta-smile/TCL
- https://github.com/YehLi/xmodaler
- https://github.com/sangminwoo/awesome-vision-and-language
- Masked Vision and Language Modeling for Multi-modal Representation Learning (2022-08-03)
- GitHub - facebookresearch/CiT: Code for the paper titled “CiT Curation in Training for Effective Vision-Language Data”.
- GitHub - guilk/VLC: Research code for “Training Vision-Language Transformers from Captions Alone”
- GitHub - OliverRensu/TinyMIM
- GitHub - RERV/UniAdapter
- [2406.16860] Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs