Goal
Train a better foundational image backbone SSL and CLIP style models don’t get sufficient supervision for all downstream tasks
Idea
Use VLM with a tiny LLM as a decoder to pose all vision related tasks as a multi task learning task, support multi image input. T5 like approach
-
small LLM + large vision backbone
-
multi task across all vision tasks