michal.i/o

❯

❯

❯

Vision - Language Models

Vision - Language Models

Nov 20, 20242 min read

Vision Language Pre Training

GitHub - microsoft/BridgeTower: Open source code for AAAI 2023 Paper “BridgeTower: Building Bridges Between Encoders in Vision-Language Representation Learning”

GitHub - microsoft/react

https://github.com/salesforce/lavis

https://github.com/uta-smile/TCL

https://github.com/YehLi/xmodaler

https://github.com/sangminwoo/awesome-vision-and-language

Masked Vision and Language Modeling for Multi-modal Representation Learning (2022-08-03)

GitHub - facebookresearch/CiT: Code for the paper titled “CiT Curation in Training for Effective Vision-Language Data”.

GitHub - guilk/VLC: Research code for “Training Vision-Language Transformers from Captions Alone”

GitHub - OliverRensu/TinyMIM

GitHub - RERV/UniAdapter![[Screen Shot 2023-04-19 at 1.13.36 PM.png]

Contrastive

CLIP

https://github.com/mlfoundations/open_clip
GitHub - baaivision/EVA: EVA Series: Vision Foundation Model Fanatics from BAAI
https://github.com/OpenAI/CLIP
https://github.com/LAION-AI/CLIP_benchmark
https://github.com/rom1504/clip-retrieval
https://github.com/rom1504/laion-prepro
GitHub - yzhuoning/Awesome-CLIP: Awesome list for research on CLIP (Contrastive Language-Image Pre-Training).
e-CLIP: Large-Scale Vision-Language Representation Learning in E-commerce
GitHub - facebookresearch/SLIP: Code release for SLIP Self-supervision meets Language-Image Pre-training
GitHub - facebookresearch/flip: Official Open Source code for “Scaling Language-Image Pre-training via Masking”
GitHub - facebookresearch/diht: Filtering, Distillation, and Hard Negatives for Vision-Language Pre-Training
GitHub - OliverRensu/DeepMIM
[2311.17049] MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training
GitHub - bytedance/fc-clip: [NeurIPS 2023] This repo contains the code for our paper Convolutions Die Hard: Open-Vocabulary Segmentation with Single Frozen Convolutional CLIP
[2307.16634] CDUL: CLIP-Driven Unsupervised Learning for Multi-Label Image Classification
[2309.05551] OpenFashionCLIP: Vision-and-Language Contrastive Learning with Open-Source Fashion Data

Vision - Language Models

[2306.07915v3] Image Captioners Are Scalable Vision Learners Too
[2311.03079] CogVLM: Visual Expert for Pretrained Language Models
GitHub - OpenBMB/MiniCPM-V: MiniCPM-V 2.6: A GPT-4V Level MLLM for Single Image, Multi Image and Video on Your Phone
GitHub - FreedomIntelligence/LongLLaVA: LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture

[Stanford CS25: V4 I From Large Language Models to Large Multimodal Models - YouTube](https://www.youtube.com/watch?v=cYfKQ6YG9Qo)

Recent Advances on Multimodal Models Pre-training - YouTube
GitHub - OpenGVLab/InternVL: [CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4o. 接近GPT-4o表现的开源多模态对话模型

CapPa

Weights & Biases - # CapPa: Training vision models as captioners

Datasets

https://github.com/rom1504/laion-prepro
GitHub - allenai/mmc4: MultimodalC4 is a multimodal extension of c4 that interleaves millions of images with text.

Vision Language Pre Training
Contrastive
CLIP
Vision - Language Models
CapPa
Datasets

Backlinks

No backlinks found

Graph View

Created with Quartz v4.4.0 © 2024