2025-01-27
Models
- YuE - Open Music Foundation Models for Full-Song Generation
- Qwen2.5 VL! Qwen2.5 VL! Qwen2.5 VL! | Qwen
- deepseek-ai/Janus-Pro-7B · Hugging Face
- Qwen2.5-1M: Deploy Your Own Qwen with Context Length up to 1M Tokens | Qwen
Papers
- [2501.14458] A Survey of Optimization Methods for Training DL Models: Theoretical Perspective on Convergence and Generalization
- [2501.14818] Eagle 2: Building Post-Training Data Strategies from Scratch for Frontier Vision-Language Models
- [2501.15369] iFormer: Integrating ConvNet and Transformer for Mobile Application
- [2501.16975] Over-Tokenized Transformer: Vocabulary is Generally Worth Scaling
- [2501.12370] Parameters vs FLOPs: Scaling Laws for Optimal Sparsity for Mixture-of-Experts Language Models
- [2501.17703] Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate
- [2501.17788] WARP: An Efficient Engine for Multi-Vector Retrieval
- [2411.15124] Tulu 3: Pushing Frontiers in Open Language Model Post-Training
- [2501.17161v1] SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training
- [2501.17811] Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
- [2501.16664] Improving Vision-Language-Action Model with Online Reinforcement Learning
- [2501.17391] Learning Free Token Reduction for Multi-Modal LLM
- Efficient Diffusion Models: A Survey | OpenReview
- [2501.11651] Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling
- [2501.18593] Diffusion Autoencoders are Scalable Image Tokenizers
- [2501.18427] SANA 1.5: Efficient Scaling of Training-Time and Inference-Time Compute in Linear Diffusion Transformer
- [2501.16182] The Linear Attention Resurrection in Vision Transformer
- [2501.13106] VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
- [2412.15188] LMFusion: Adapting Pretrained Language Models for Multimodal Generation
- [2501.17486] DINT Transformer
- [2501.18596] DeltaLLM: Compress LLMs with Low-Rank Deltas between Shared Weights
- [2501.16273] Return of the Encoder: Maximizing Parameter Efficiency for SLMs
- [2501.15513] TinyLLaVA-Video: A Simple Framework of Small-scale Large Multimodal Models for Video Understanding
- [2501.15665] StagFormer: Time Staggering Transformer Decoding for RunningLayers In Parallel
Code
- GitHub - vllm-project/production-stack
- [GitHub - hkust-nlp/simpleRL-reason: This is a replicate of DeepSeek-R1-Zero and DeepSeek-R1 training on small models with limited data](https://github.com/hkust-nlp/simpleRL-reason
- GitHub - TIGER-AI-Lab/CritiqueFineTuning: Code for “Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate”
- GitHub - jlscheerer/xtr-warp: XTR/WARP is an extremely fast and accurate retrieval engine based on Stanford’s ColBERTv2/PLAID and Google DeepMind’s XTR.
- GitHub - Jiayi-Pan/TinyZero: Clean, accessible reproduction of DeepSeek R1-Zero
- GitHub - ZihanWang314/RAGEN: RAGEN is the first open-source reproduction of DeepSeek-R1 for training agentic models via reinforcement learning.
- GitHub - sail-sg/oat: 🌾 OAT: A research-friendly framework for LLM online alignment, including preference learning, reinforcement learning, etc.
- GitHub - microsoft/encoder-decoder-slm: Efficient encoder-decoder architecture for small language models (≤1B parameters) with cross-architecture knowledge distillation and vision-language capabilities
Articles
- vLLM V1: A Major Upgrade to vLLM’s Core Architecture | vLLM Blog
- Run DeepSeek-R1 Dynamic 1.58-bit
- Memory Snapshots: Checkpoint/Restore for Sub-second Startup | Modal Blog
- Mini-R1: Reproduce Deepseek R1 „aha moment“ a RL tutorial
- NeurIPS 2024: Diffusion Themes and Memes – Joshua Bambrick’s Blog
Videos
Other
- [ ]