michal.i/o

❯

❯

❯

Transformer Alternatives (mostly SSMs)

Transformer Alternatives (mostly SSMs)

Jan 21, 20252 min read

talk_250117.pdf
GitHub - sustcsonglin/flash-linear-attention: Efficient implementations of state-of-the-art linear attention models in Pytorch and Triton
retnet
GitHub - BlinkDL/RWKV-LM: RWKV is an RNN with transformer-level LLM performance. It can be directly trained like a GPT (parallelizable). So it’s combining the best of RNN and transformer - great performance, fast inference, saves VRAM, fast training, “infinite” ctx_len, and free sentence embedding.
- GitHub - BlinkDL/ChatRWKV: ChatRWKV is like ChatGPT but powered by RWKV (100% RNN) language model, and open source.
[2306.11197] Sparse Modular Activation for Efficient Sequence Modeling
GitHub - renll/SeqBoat: [NeurIPS 2023] Sparse Modular Activation for Efficient Sequence Modeling
[2501.00658] Understanding and Mitigating Bottlenecks of State Space Models through the Lens of Recency and Over-smoothing

Architectures

RWKV

RWKV-LM/RWKV-v7 at main · BlinkDL/RWKV-LM · GitHub

Monarch Mixer

Making Foundation Models More Efficient - Dan Fu | Stanford MLSys #86 - YouTube

Mamba

Mamba: The Hard Way

Mamba2

Codestral Mamba (Mistral)

H3

Zamba

Zamba2

Zyphra

Jamba

Samba

GitHub - microsoft/Samba: Official implementation of “Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling”

minGRU and minLSTM

[2410.01201] Were RNNs All We Needed?

xLSTM

Hawk and Griffin

Fetching Title#mdig
https://www.youtube.com/watch?v=0Yi3yUjB-3M&list=PPSV

DeltaNet

Gated DeltaNets

[2412.06464] Gated Delta Networks: Improving Mamba2 with Delta Rule
GitHub - NVlabs/GatedDeltaNet: Official PyTorch Implementation of Gated Delta Networks: Improving Mamba2 with Delta Rule

Hymba

[2411.13676] Hymba: A Hybrid-head Architecture for Small Language Models nvidia/Hymba-1.5B-Base · Hugging Face
modeling_hymba.py · nvidia/Hymba-1.5B-Base at main

claims parallel heads work better than stacking
adds register tokens (128)
- [ ]
uses KV cache sharing across layers

Attamba

[2411.17685] Attamba: Attending To Multi-Token States

Bamba

GitHub - foundation-model-stack/bamba: Train, tune, and infer Bamba model

Patterns

Hybrids

Pruning

Simba (Hierarchical SSM Pruning)

Sparsified State-Space Models are Efficient Highway Networks

Architectures
RWKV
Monarch Mixer
Mamba
Mamba2
Codestral Mamba (Mistral)
H3
Zamba
Zamba2
Jamba
Samba
minGRU and minLSTM
xLSTM
Hawk and Griffin
DeltaNet
Gated DeltaNets
Hymba
Attamba
Bamba
Patterns
Hybrids
Pruning
Simba (Hierarchical SSM Pruning)

Backlinks

Mamba

Graph View

Created with Quartz v4.4.0 © 2025