michal.i/o

❯

❯

❯

❯

Mixture of Modules

Mixture of Modules

Jan 21, 20251 min read

Hymba claims parallel SSM and Attention woke better than stacking, see if a MoE of attention, SSM and Conv works better

Also potentially see if you can mix different forms of attention with shared params

Also see mix transformer and mixture of depth

MOE that lets you skip full attention

Backlinks

No backlinks found

Graph View

Created with Quartz v4.4.0 © 2025