michal.i/o

❯

❯

2024-12-02

Jan 21, 20253 min read

star

Models

Site Unreachable
voyage-code-3: more accurate code retrieval with lower dimensional, quantized embeddings – Voyage AI
Introducing Moondream 0.5B: The World’s Smallest Vision-Language Model

Papers

[2411.18363] ChatRex: Taming Multimodal LLM for Joint Perception and Understanding
[2411.18207v1] From Open Vocabulary to Open World: Teaching Vision Language Models to Detect Novel Objects
[2411.18674v1] Active Data Curation Effectively Distills Large-Scale Multimodal Models star
[2411.19865] Reverse Thinking Makes LLMs Stronger Reasoners
[2411.18933v1] Efficient Track Anything
[2411.16085] Cautious Optimizers: Improving Training with One Line of Code
[2411.16828] CLIPS: An Enhanced CLIP Framework for Learning with Synthetic Captions
[2412.00714] Scaling New Frontiers: Insights into Large Recommendation Models
[2411.16205] MH-MoE: Multi-Head Mixture-of-Experts
[2412.01720] LamRA: Large Multimodal Model as Your Advanced Retrieval Assistant
[2412.00965v1] Token Cropr: Faster ViTs for Quite a Few Tasks
[2412.01818] [CLS] Attention is All You Need for Training-Free Visual Token Pruning: Make VLM Inference Faster
[2412.01822v1] VLsI: Verbalized Layers-to-Interactions from Large to Small Vision Language Models
[2411.19842v1] Scaling Transformers for Low-Bitrate High-Quality Speech Coding
[2412.01562] Detection, Pose Estimation and Segmentation for Multiple Bodies: Closing the Virtuous Circle
[2410.23970] TrAct: Making First-layer Pre-Activations Trainable
[2412.01824] X-Prompt: Towards Universal In-Context Image Generation in Auto-Regressive Vision Language Foundation Models
[2412.01819] Switti: Designing Scale-Wise Transformers for Text-to-Image Synthesis
[2412.01940] Down with the Hierarchy: The ‘H’ in HNSW Stands for “Hubs”
[2412.01951] Self-Improvement in Language Models: The Sharpening Mechanism
[2210.13452] MetaFormer Baselines for Vision
[2412.03555] PaliGemma 2: A Family of Versatile VLMs for Transfer
Probabilistic weather forecasting with machine learning | Nature
[2412.03561v1] FLAIR: VLM with Fine-grained Language-informed Image Representations
[2411.16318] One Diffusion to Generate Them All
[2412.04431] Infinity: Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis
[2412.04468] NVILA: Efficient Frontier Visual Language Models
One Step Diffusion via Shortcut Models | OpenReview
InternVL/InternVL2_5_report.pdf at main · OpenGVLab/InternVL · GitHub
[2412.04234v1] DEIM: DETR with Improved Matching for Fast Convergence
[2412.04332] Liquid: Language Models are Scalable Multi-modal Generators
[2412.03215] Beyond [cls]: Exploring the true potential of Masked Image Modeling representations

Code

GitHub - IDEA-Research/ChatRex: Code for ChatRex: Taming Multimodal LLM for Joint Perception and Understanding
GitHub - 343gltysprk/ovow
GitHub - Tencent/HunyuanVideo: HunyuanVideo: A Systematic Framework For Large Video Generation Model Training
GitHub - benbergner/cropr
GitHub - Theia-4869/FasterVLM: Official code for paper: [CLS] Attention is All You Need for Training-Free Visual Token Pruning: Make VLM Inference Faster.
GitHub - fishaudio/fish-speech: SOTA Open Source TTS

Articles

Diffusion Meets Flow Matching
Optimizing ColPali for Retrieval at Scale, 13x Faster Results - Qdrant
OpenAI’s o1 using “search” was a PSYOP - by Nathan Lambert
GenCast predicts weather and the risks of extreme conditions with state-of-the-art accuracy - Google DeepMind

Videos

The Hitchhiker’s Guide to Reasoning - YouTube star
Aaron Defazio Talk (12.06.2024, UCLA) - YouTube
NEURAL NETWORKS ARE REALLY WEIRD… - YouTube

Other

GitHub - huggingface/smol-course: A course on aligning smol models.
arsaporta/symile-m3 · Datasets at Hugging Face

Tweets

Models
Papers
Code
Articles
Videos
Other
Tweets

Backlinks

No backlinks found

Graph View

Created with Quartz v4.4.0 © 2025