-
[2306.11197] Sparse Modular Activation for Efficient Sequence Modeling
-
GitHub - renll/SeqBoat: [NeurIPS 2023] Sparse Modular Activation for Efficient Sequence Modeling
Architectures
RWKV
Monarch Mixer
Making Foundation Models More Efficient - Dan Fu | Stanford MLSys #86 - YouTube
Mamba
Mamba2
Codestral Mamba (Mistral)
H3
Zamba
Zamba2
Jamba
Samba
minGRU and minLSTM
xLSTM
Hawk and Griffin
DeltaNet
Gated DeltaNets
- [2412.06464] Gated Delta Networks: Improving Mamba2 with Delta Rule
- GitHub - NVlabs/GatedDeltaNet: Official PyTorch Implementation of Gated Delta Networks: Improving Mamba2 with Delta Rule
Hymba
-
[2411.13676] Hymba: A Hybrid-head Architecture for Small Language Models nvidia/Hymba-1.5B-Base · Hugging Face
- claims parallel heads work better than stacking
- adds register tokens (128)
- [ ]
- uses KV cache sharing across layers