multimodalvlm

TLDR

  1. Labeling data sucks, we have paired image - text data that we can learn from in a “self” supervised fashion
  2. Contrastive models are good for retrieval but:
    1. require huge batches to train
    2. collapse to bag of word model and don’t have “semantic” understanding of images
  3. Autoregressive Vision Language models learn richer representations of text and offer a lot of other advantages that come with next token prediction:
    1. in context learning
    2. prompting
  4. LLaVA style architecture is standard (MLP to align pretrained vision encoder outputs with pretrained LLM inputs)

Contrastive Vision Language Models

Train separate backbones for image and text, align them with a contrastive objective ala standard two tower approach

CLIP

SigLIP

Extensions

FLAIR

Autoregressive Vision Language Models

CapPa

LLaVA

LLaVA-OneVision

Flamingo

Qwen2 VL

Pixtral

InternVL

InternVL 2.5

MiniCPM

Molmo

LLAMA 3.2 Vision

Chameleon

AIMv2 (Multimodal Autoregressive Pre-training of Large Vision Encoders)

Mixture-of-Transformer

NVLM

PaliGemma

CogVLM

Aria

Florence-VL

SmolVLM

Moondream

Other Approaches

Aligning Vision and LLMs

Multi Task / Combined Objectives

VladVA

Generative VLMs

Patterns

Data Curation / Data Augmentation

Filter for Quality

CLIP score to filter out web data that has captions that don’t align with the image

Expand Captions

Use existing VLMs to generate more detailed captions

Modality Alignment

MLP projection from Vision Backbone to LLM

Image Encoding

Multiple Resolutions, Scales and Aspect Ratios

End of Row Encoding

Positional Encodings

Token Compression / Dropping

Early vs Late Fusion

Datasets

Leaderboard / Benchmarks

Links