- transformers quadratic with input length
- do well modeling interactions in a fixed window size, but can’t model beyond the supported window
- mamba
- linear in sequence length
- 5x higher throughput than transformers
- beats large transformers
- Structured State Space Sequence Models (SSMs) (S4)
TODO
- continue