Models
- Llama 3.2: Revolutionizing edge AI and vision with open, customizable modelsvlm
- molmo.allenai.org/blog molmo multimodal modelsvlm
- stepfun-ai/GOT-OCR2_0 · Hugging Faceocr
- GitHub - ByungKwanLee/Phantom: [Under Review] Official PyTorch implementation code for realizing the technical part of Phantom of Latent representing equipped with enlarged hidden dimension to build super frontier vision language models.vlm
Papers
- [2409.13523] EMMeTT: Efficient Multimodal Machine Translation Training
- [2409.14683] Reducing the Footprint of Multi-Vector Retrieval with Minimal Performance Impact via Token Pooling
- [2409.15278] PixWizard: Versatile Image-to-Image Visual Assistant with Open-Language Instructions
Code
- RWKV-LM/RWKV-v7 at main · BlinkDL/RWKV-LM · GitHub
- GitHub - willccbb/mlx_parallm: Fast parallel LLM inference for MLXmlxinference
Articles
- The Practitioner’s Guide to the Maximal Update Parameterization | EleutherAI Blogscaling
- Understanding how LLM inference works with llama.cpp llama.cpp
- Techniques for KV Cache Optimization in Large Language Modelskvcachellminference
- The basic idea behind FlashAttentionflash-attentionsoftmax
- FlexAttention: The Flexibility of PyTorch with the Performance of FlashAttention | PyTorch
- Tune Llama3 405B on AMD MI300x (our journey) - Felafax Blog - Obsidian Publishamdjax
- Exploring Parallel Strategies with Jax | AstraBlogdistributedjax
- Power of Diffusion Models | AstraBlogdiffusion
- GenAI Handbook
Videos
- Boris Hanin | Scaling Limits of Neural Networks - YouTube
- MLBBQ: Flash Atttention by Mike Doan - YouTubeflash-attention
Other
- [ ]