All to all pretraining

MOE with default active experts for each modality

block router with choice of conv, attention and etc

large register bank

conv encoders on characters and pixels (separate encoders)

distill from a bunch of SOTA models

  • Pros
    • Easier to train in distributed setting (don’t have model size inbalance)