Models
- Site Unreachable
- voyage-code-3: more accurate code retrieval with lower dimensional, quantized embeddings – Voyage AI
- Introducing Moondream 0.5B: The World’s Smallest Vision-Language Model
Papers
- [2411.18363] ChatRex: Taming Multimodal LLM for Joint Perception and Understanding
- [2411.18207v1] From Open Vocabulary to Open World: Teaching Vision Language Models to Detect Novel Objects
- [2411.18674v1] Active Data Curation Effectively Distills Large-Scale Multimodal Modelsstar
- [2411.19865] Reverse Thinking Makes LLMs Stronger Reasoners
- [2411.18933v1] Efficient Track Anything
- [2411.16085] Cautious Optimizers: Improving Training with One Line of Code
- [2411.16828] CLIPS: An Enhanced CLIP Framework for Learning with Synthetic Captions
- [2412.00714] Scaling New Frontiers: Insights into Large Recommendation Models
- [2411.16205] MH-MoE: Multi-Head Mixture-of-Experts
- [2412.01720] LamRA: Large Multimodal Model as Your Advanced Retrieval Assistant
- [2412.00965v1] Token Cropr: Faster ViTs for Quite a Few Tasks
- [2412.01818] [CLS] Attention is All You Need for Training-Free Visual Token Pruning: Make VLM Inference Faster
- [2412.01822v1] VLsI: Verbalized Layers-to-Interactions from Large to Small Vision Language Models
- [2411.19842v1] Scaling Transformers for Low-Bitrate High-Quality Speech Coding
- [2412.01562] Detection, Pose Estimation and Segmentation for Multiple Bodies: Closing the Virtuous Circle
- [2410.23970] TrAct: Making First-layer Pre-Activations Trainable
- [2412.01824] X-Prompt: Towards Universal In-Context Image Generation in Auto-Regressive Vision Language Foundation Models
- [2412.01819] Switti: Designing Scale-Wise Transformers for Text-to-Image Synthesis
- [2412.01940] Down with the Hierarchy: The ‘H’ in HNSW Stands for “Hubs”
- [2412.01951] Self-Improvement in Language Models: The Sharpening Mechanism
- [2210.13452] MetaFormer Baselines for Vision
- [2412.03555] PaliGemma 2: A Family of Versatile VLMs for Transfer
- Probabilistic weather forecasting with machine learning | Nature
- [2412.03561v1] FLAIR: VLM with Fine-grained Language-informed Image Representations
- [2411.16318] One Diffusion to Generate Them All
- [2412.04431] Infinity: Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis
- [2412.04468] NVILA: Efficient Frontier Visual Language Models
- One Step Diffusion via Shortcut Models | OpenReview
- InternVL/InternVL2_5_report.pdf at main · OpenGVLab/InternVL · GitHub
- [2412.04234v1] DEIM: DETR with Improved Matching for Fast Convergence
- [2412.04332] Liquid: Language Models are Scalable Multi-modal Generators
- [2412.03215] Beyond [cls]: Exploring the true potential of Masked Image Modeling representations
Code
- GitHub - IDEA-Research/ChatRex: Code for ChatRex: Taming Multimodal LLM for Joint Perception and Understanding
- GitHub - 343gltysprk/ovow
- GitHub - Tencent/HunyuanVideo: HunyuanVideo: A Systematic Framework For Large Video Generation Model Training
- GitHub - benbergner/cropr
- GitHub - Theia-4869/FasterVLM: Official code for paper: [CLS] Attention is All You Need for Training-Free Visual Token Pruning: Make VLM Inference Faster.
- GitHub - fishaudio/fish-speech: SOTA Open Source TTS
Articles
- Diffusion Meets Flow Matching
- Optimizing ColPali for Retrieval at Scale, 13x Faster Results - Qdrant
- OpenAI’s o1 using “search” was a PSYOP - by Nathan Lambert
- GenCast predicts weather and the risks of extreme conditions with state-of-the-art accuracy - Google DeepMind
Videos
- The Hitchhiker’s Guide to Reasoning - YouTubestar
- Aaron Defazio Talk (12.06.2024, UCLA) - YouTube
- NEURAL NETWORKS ARE REALLY WEIRD… - YouTube
Other
- GitHub - huggingface/smol-course: A course on aligning smol models.
- arsaporta/symile-m3 · Datasets at Hugging Face