Hymba claims parallel SSM and Attention woke better than stacking, see if a MoE of attention, SSM and Conv works better

Also potentially see if you can mix different forms of attention with shared params

Also see mix transformer and mixture of depth