michal.i/o

❯

❯

2024-10-28

Jan 21, 20252 min read

Models

[ ]

Papers

[2410.19456] Computational Bottlenecks of Training Small-scale Large Language Models
[2410.19313] COAT: Compressing Optimizer states and Activation for Memory-Efficient FP8 Training
[2410.19324] Simpler Diffusion (SiD2): 1.5 FID on ImageNet512 with pixel-space diffusion
[2410.18111] Data Efficiency for Large Recommendation Models
[2410.16048] Continuous Speech Synthesis using per-token Latent Diffusion
[2410.18558] Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data
[2410.18613] Rethinking Softmax: Self-Attention with Polynomial Activations
[2410.17980] Stick-breaking Attention
Towards Learning to Reason at Pre-Training Scale | OpenReview
HIL-SERL: Precise and Dexterous Robotic Manipulation via Human-in-the-Loop Reinforcement Learning
[2410.18779] A Little Help Goes a Long Way: Efficient LLM Training by Leveraging Small LMs
[2410.20672] Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA
[2410.20280] MarDini: Masked Autoregressive Diffusion for Video Generation at Scale
[2410.21465] ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference
[2405.02793] ImageInWords: Unlocking Hyper-Detailed Image Descriptions
[2406.11832] Unveiling Encoder-Free Vision-Language Models
[2410.20399] ThunderKittens: Simple, Fast, and Adorable AI Kernels GitHub - HazyResearch/ThunderKittens: Tile primitives for speedy kernels
GitHub - multimodal-interpretability/nnn: Nearest Neighbor Normalization (EMNLP 2024)
pi0.pdf
[2410.22179] Very Attentive Tacotron: Robust and Unbounded Length Generalization in Autoregressive Transformer-Based Text-to-Speech
ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing | OpenReview

Code

[ ]

Articles

Our First Generalist Policy

Videos

Guest Lecture by Kylo Lo: Demystifying data curation for pretrained language models - YouTube
SKVQ: Sliding-window Key and Value Cache Quantization for Large Language Models - YouTube
Cohere For AI - Community Talks: Randall Balestriero - YouTube
Training Zamba: A Hybrid Model Master Class with Zyphra’s Quentin Anthony - YouTube

Other

[ ]

Tweets

Notes

Weight sharing for attention layers
- Compute looks the same, tie weights to limit parameter counts
SSM + Attention Hybrids
- Can drop most attention layers and use SSMs in their place
- Sliding window attention + Mamba (from Samba)

Models
Papers
Code
Articles
Videos
Other
Tweets
Notes

Backlinks

No backlinks found

Graph View

Created with Quartz v4.4.0 © 2025