michal.i/o

❯

❯

❯

❯

VLMs for better Vision Backbones

VLMs for better Vision Backbones

Jan 21, 20251 min read

Goal

Train a better foundational image backbone SSL and CLIP style models don’t get sufficient supervision for all downstream tasks

Idea

Use VLM with a tiny LLM as a decoder to pose all vision related tasks as a multi task learning task, support multi image input. T5 like approach

small LLM + large vision backbone
multi task across all vision tasks
GitHub - TIGER-AI-Lab/VLM2Vec: This repo contains the code and data for “VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks”

Goal
Idea

Backlinks

No backlinks found

Graph View

Created with Quartz v4.4.0 © 2025