All to all pretraining
MOE with default active experts for each modality
block router with choice of conv, attention and etc
large register bank
conv encoders on characters and pixels (separate encoders)
distill from a bunch of SOTA models
- Pros
- Easier to train in distributed setting (don’t have model size inbalance)