Hymba claims parallel SSM and Attention woke better than stacking, see if a MoE of attention, SSM and Conv works better
Also potentially see if you can mix different forms of attention with shared params
Also see mix transformer and mixture of depth