Architectures

RWKV

Monarch Mixer

Making Foundation Models More Efficient - Dan Fu | Stanford MLSys #86 - YouTube

Mamba

Mamba2

Codestral Mamba (Mistral)

H3

Zamba

Zamba2

YouTube

Jamba

Samba

minGRU and minLSTM

xLSTM

Hawk and Griffin

DeltaNet

DeltaNet

Gated DeltaNets

Hymba

  • claims parallel heads work better than stacking
  • adds register tokens (128)
    • [ ]
  • uses KV cache sharing across layers

Attamba

Bamba

Patterns

Hybrids

Pruning

Simba (Hierarchical SSM Pruning)