michal.i/o

❯

❯

❯

Vision Language Models

Vision Language Models

Jan 21, 20254 min read

multimodal
vlm

TLDR

Labeling data sucks, we have paired image - text data that we can learn from in a “self” supervised fashion
Contrastive models are good for retrieval but:
1. require huge batches to train
2. collapse to bag of word model and don’t have “semantic” understanding of images
Autoregressive Vision Language models learn richer representations of text and offer a lot of other advantages that come with next token prediction:
1. in context learning
2. prompting
LLaVA style architecture is standard (MLP to align pretrained vision encoder outputs with pretrained LLM inputs)

Contrastive Vision Language Models

Train separate backbones for image and text, align them with a contrastive objective ala standard two tower approach

CLIP

https://github.com/mlfoundations/open_clip
GitHub - baaivision/EVA: EVA Series: Vision Foundation Model Fanatics from BAAI
https://github.com/OpenAI/CLIP
https://github.com/LAION-AI/CLIP_benchmark
https://github.com/rom1504/clip-retrieval
https://github.com/rom1504/laion-prepro
GitHub - yzhuoning/Awesome-CLIP: Awesome list for research on CLIP (Contrastive Language-Image Pre-Training).
e-CLIP: Large-Scale Vision-Language Representation Learning in E-commerce
GitHub - facebookresearch/SLIP: Code release for SLIP Self-supervision meets Language-Image Pre-training
GitHub - facebookresearch/flip: Official Open Source code for “Scaling Language-Image Pre-training via Masking”
GitHub - facebookresearch/diht: Filtering, Distillation, and Hard Negatives for Vision-Language Pre-Training
GitHub - OliverRensu/DeepMIM
[2311.17049] MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training
GitHub - bytedance/fc-clip: [NeurIPS 2023] This repo contains the code for our paper Convolutions Die Hard: Open-Vocabulary Segmentation with Single Frozen Convolutional CLIP
[2307.16634] CDUL: CLIP-Driven Unsupervised Learning for Multi-Label Image Classification
[2309.05551] OpenFashionCLIP: Vision-and-Language Contrastive Learning with Open-Source Fashion Data
[2410.18857] Probabilistic Language-Image Pre-Training
[2410.05270] Fine-Tuning CLIP’s Last Visual Projector: A Few-Shot Cornucopia

SigLIP

Extensions

FLAIR

Autoregressive Vision Language Models

Notion – The all-in-one workspace for your notes, tasks, wikis, and databases.
[2306.07915v3] Image Captioners Are Scalable Vision Learners Too
[2311.03079] CogVLM: Visual Expert for Pretrained Language Models
GitHub - OpenBMB/MiniCPM-V: MiniCPM-V 2.6: A GPT-4V Level MLLM for Single Image, Multi Image and Video on Your Phone
GitHub - FreedomIntelligence/LongLLaVA: LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture
[Stanford CS25: V4 I From Large Language Models to Large Multimodal Models - YouTube](https://www.youtube.com/watch?v=cYfKQ6YG9Qo)
Recent Advances on Multimodal Models Pre-training - YouTube
GitHub - OpenGVLab/InternVL: [CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4o. 接近GPT-4o表现的开源多模态对话模型
[2501.10318] HiMix: Reducing Computational Complexity in Large Vision-Language Models

CapPa

[2306.07915v5] Image Captioners Are Scalable Vision Learners Too
Weights & Biases - # CapPa: Training vision models as captioners

SuperClass

[2411.03313] Classification Done Right for Vision-Language Pre-Training

LLaVA

LLaVA-OneVision

[2408.03326] LLaVA-OneVision: Easy Visual Task Transfer
[2412.00876] Dynamic-LLaVA: Efficient Multimodal Large Language Models via Dynamic Vision-language Context Sparsification

Flamingo

Qwen2 VL

Pixtral

InternVL

InternVL 2.5

InternVL/InternVL2_5_report.pdf at main · OpenGVLab/InternVL · GitHub

MiniCPM

Molmo

LLAMA 3.2 Vision

Chameleon

[2405.09818] Chameleon: Mixed-Modal Early-Fusion Foundation Models

AIMv2 (Multimodal Autoregressive Pre-training of Large Vision Encoders)

GitHub - apple/ml-aim: This repository provides the code and model checkpoints for AIMv1 and AIMv2 research projects.
[2401.08541] Scalable Pre-training of Large Autoregressive Image Models

Mixture-of-Transformer

NVLM

PaliGemma

CogVLM

Aria

GitHub - rhymes-ai/Aria: Codebase for Aria - an Open Multimodal Native MoE

Florence-VL

[2412.04424] Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion

SmolVLM

Moondream

FastVLM

[2412.13303] FastVLM: Efficient Vision Encoding for Vision Language Models

Other Approaches

Aligning Vision and LLMs

[2412.04616] Assessing and Learning Alignment of Unimodal Vision and Language Models

Multi Task / Combined Objectives

VladVA

[2412.04378] Discriminative Fine-tuning of LVLMs

Generative VLMs

[2412.03069] TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation
[2412.04332] Liquid: Language Models are Scalable Multi-modal Generators
Generative Models

Patterns

Data Curation / Data Augmentation

Filter for Quality

CLIP score to filter out web data that has captions that don’t align with the image

Expand Captions

Use existing VLMs to generate more detailed captions

Modality Alignment

MLP projection from Vision Backbone to LLM

Image Encoding

Multiple Resolutions, Scales and Aspect Ratios

End of Row Encoding

Positional Encodings

Token Compression / Dropping

[2412.04467] VisionZip: Longer is Better but Not Necessary in Vision Language Models

Early vs Late Fusion

Datasets

https://github.com/rom1504/laion-prepro
GitHub - allenai/mmc4: MultimodalC4 is a multimodal extension of c4 that interleaves millions of images with text.
nyu-visionx/Cambrian-10M · Datasets at Hugging Face
allenai/pixmo-cap-qa · Datasets at Hugging Face

Leaderboard / Benchmarks

GitHub - cambrian-mllm/cambrian: Cambrian-1 is a family of multimodal LLMs with a vision-centric design.

Links

GitHub - microsoft/BridgeTower: Open source code for AAAI 2023 Paper “BridgeTower: Building Bridges Between Encoders in Vision-Language Representation Learning”
GitHub - microsoft/react
https://github.com/salesforce/lavis
https://github.com/uta-smile/TCL
https://github.com/YehLi/xmodaler
https://github.com/sangminwoo/awesome-vision-and-language
Masked Vision and Language Modeling for Multi-modal Representation Learning (2022-08-03)
GitHub - facebookresearch/CiT: Code for the paper titled “CiT Curation in Training for Effective Vision-Language Data”.
GitHub - guilk/VLC: Research code for “Training Vision-Language Transformers from Captions Alone”
GitHub - OliverRensu/TinyMIM
GitHub - RERV/UniAdapter
[2406.16860] Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs

Backlinks

No backlinks found

Graph View

Created with Quartz v4.4.0 © 2025