Models
Papers
- [2410.14072v1] Efficient Vision-Language Models by Summarizing Visual Tokens into Compact Registers
- [2410.11190] Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities
- [2410.15458] Allegro: Open the Black Box of Commercial-Level Video Generation Model
- [2406.15786] What Matters in Transformers? Not All Attention is Needed
- [2410.15732v1] ViMoE: An Empirical Study of Designing Vision Mixture-of-Experts
- fork pretrained ViT (DINOv2) by replicating FFN weights into multiple experts
- route on CLS token to same experts at image level
- load balancing loss for balanced routing
- shared experts that are always active for “common knowledge”
- top-1 expert routing
-
- Es = shared expert
- [2410.16261v1] Mini-InternVL: A Flexible-Transfer Pocket Multimodal Model with 5% Parameters and 90% Performance
- https://openreview.net/pdf?id=vI95kcLAoU
- [2410.16512v1] TIPS: Text-Image Pretraining with Spatial Awareness
- [2410.17243v1] Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss
- [2410.17251v1] Altogether: Image Captioning via Re-aligning Alt-text
- [2410.18967] Ferret-UI 2: Mastering Universal User Interface Understanding Across Platforms
Code
- [ ]
Articles
- Reaching 1B context length with RAG
- How Speculative Decoding Boosts vLLM Performance by up to 2.8x | vLLM Blog
- Simplifying, stabilizing, and scaling continuous-time consistency models | OpenAI
- Building Vectorize, a distributed vector database, on Cloudflare’s Developer Platform
Videos
- [ ]
Other
- [ ]
**