PaliGemma

#vlm

PaliGemma 2

TLDR

  1. Take pretrained SigLip Model and Gemma LLM, add linear projection from SigLip tokens to Gemma
  2. Train whole thing end to end on 224x224 resolution (1 Billion Examples)
  3. Tune at larger resolutions (50 mil at 448, then 10 mil at 896)
    1. sample tasks that require larger resolution like OCR
  4. Tune for downstream tasks


vlm