michal.i/o

❯

❯

2024-09-23

Jan 21, 20252 min read

vlm
ocr
mlx
inference
scaling
kvcache
llm
flash-attention
softmax
amd
jax
distributed
diffusion

Models

Llama 3.2: Revolutionizing edge AI and vision with open, customizable models vlm
molmo.allenai.org/blog molmo multimodal modelsvlm
stepfun-ai/GOT-OCR2_0 · Hugging Face ocr
GitHub - ByungKwanLee/Phantom: [Under Review] Official PyTorch implementation code for realizing the technical part of Phantom of Latent representing equipped with enlarged hidden dimension to build super frontier vision language models.vlm

Papers

[2409.13523] EMMeTT: Efficient Multimodal Machine Translation Training
[2409.14683] Reducing the Footprint of Multi-Vector Retrieval with Minimal Performance Impact via Token Pooling
[2409.15278] PixWizard: Versatile Image-to-Image Visual Assistant with Open-Language Instructions

Code

RWKV-LM/RWKV-v7 at main · BlinkDL/RWKV-LM · GitHub
GitHub - willccbb/mlx_parallm: Fast parallel LLM inference for MLX mlx inference

Articles

The Practitioner’s Guide to the Maximal Update Parameterization | EleutherAI Blog scaling
Understanding how LLM inference works with llama.cpp llama.cpp
Techniques for KV Cache Optimization in Large Language Models kvcache llm inference
The basic idea behind FlashAttentionflash-attentionsoftmax
- From Online Softmax to FlashAttention
FlexAttention: The Flexibility of PyTorch with the Performance of FlashAttention | PyTorch
Tune Llama3 405B on AMD MI300x (our journey) - Felafax Blog - Obsidian Publish amd jax
Exploring Parallel Strategies with Jax | AstraBlog distributed jax
Power of Diffusion Models | AstraBlog diffusion
GenAI Handbook

Videos

Boris Hanin | Scaling Limits of Neural Networks - YouTube
MLBBQ: Flash Atttention by Mike Doan - YouTube flash-attention

Other

[ ]

Models
Papers
Code
Articles
Videos
Other

Backlinks

No backlinks found

Graph View

Created with Quartz v4.4.0 © 2025