Models
- [ ]
Papers
- [2410.19456] Computational Bottlenecks of Training Small-scale Large Language Models
- [2410.19313] COAT: Compressing Optimizer states and Activation for Memory-Efficient FP8 Training
- [2410.19324] Simpler Diffusion (SiD2): 1.5 FID on ImageNet512 with pixel-space diffusion
- [2410.18111] Data Efficiency for Large Recommendation Models
- [2410.16048] Continuous Speech Synthesis using per-token Latent Diffusion
- [2410.18558] Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data
- [2410.18613] Rethinking Softmax: Self-Attention with Polynomial Activations
- [2410.17980] Stick-breaking Attention
- Towards Learning to Reason at Pre-Training Scale | OpenReview
- HIL-SERL: Precise and Dexterous Robotic Manipulation via Human-in-the-Loop Reinforcement Learning
- [2410.18779] A Little Help Goes a Long Way: Efficient LLM Training by Leveraging Small LMs
- [2410.20672] Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA
- [2410.20280] MarDini: Masked Autoregressive Diffusion for Video Generation at Scale
- [2410.21465] ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference
- [2405.02793] ImageInWords: Unlocking Hyper-Detailed Image Descriptions
- [2406.11832] Unveiling Encoder-Free Vision-Language Models
- [2410.20399] ThunderKittens: Simple, Fast, and Adorable AI Kernels GitHub - HazyResearch/ThunderKittens: Tile primitives for speedy kernels
- GitHub - multimodal-interpretability/nnn: Nearest Neighbor Normalization (EMNLP 2024)
- pi0.pdf
- [2410.22179] Very Attentive Tacotron: Robust and Unbounded Length Generalization in Autoregressive Transformer-Based Text-to-Speech
- ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing | OpenReview
Code
- [ ]
Articles
Videos
- Guest Lecture by Kylo Lo: Demystifying data curation for pretrained language models - YouTube
- SKVQ: Sliding-window Key and Value Cache Quantization for Large Language Models - YouTube
- Cohere For AI - Community Talks: Randall Balestriero - YouTube
- Training Zamba: A Hybrid Model Master Class with Zyphra’s Quentin Anthony - YouTube
Other
- [ ]
Tweets
Notes
- Weight sharing for attention layers
- Compute looks the same, tie weights to limit parameter counts
- SSM + Attention Hybrids
- Can drop most attention layers and use SSMs in their place
- Sliding window attention + Mamba (from Samba)