Models

  • [ ]

Papers

Code

  • [ ]

Articles

Videos

Other

  • [ ]

Tweets

Notes

  • Weight sharing for attention layers
    • Compute looks the same, tie weights to limit parameter counts
  • SSM + Attention Hybrids
    • Can drop most attention layers and use SSMs in their place
    • Sliding window attention + Mamba (from Samba)