Vision Language Pre Training
https://github.com/salesforce/lavis
https://github.com/uta-smile/TCL
https://github.com/YehLi/xmodaler
https://github.com/sangminwoo/awesome-vision-and-language
Masked Vision and Language Modeling for Multi-modal Representation Learning (2022-08-03)
GitHub - guilk/VLC: Research code for “Training Vision-Language Transformers from Captions Alone”
GitHub - RERV/UniAdapter![[Screen Shot 2023-04-19 at 1.13.36 PM.png]
Contrastive
CLIP
-
GitHub - baaivision/EVA: EVA Series: Vision Foundation Model Fanatics from BAAI
-
e-CLIP: Large-Scale Vision-Language Representation Learning in E-commerce
-
[2311.17049] MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training
-
[2307.16634] CDUL: CLIP-Driven Unsupervised Learning for Multi-Label Image Classification
-
[2309.05551] OpenFashionCLIP: Vision-and-Language Contrastive Learning with Open-Source Fashion Data
Vision - Language Models
- [2306.07915v3] Image Captioners Are Scalable Vision Learners Too
- [2311.03079] CogVLM: Visual Expert for Pretrained Language Models
- GitHub - OpenBMB/MiniCPM-V: MiniCPM-V 2.6: A GPT-4V Level MLLM for Single Image, Multi Image and Video on Your Phone
- GitHub - FreedomIntelligence/LongLLaVA: LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture
[Stanford CS25: V4 I From Large Language Models to Large Multimodal Models - YouTube](https://www.youtube.com/watch?v=cYfKQ6YG9Qo)