Early Fusion Multimodal Encoder Models

All to all pretraining

MOE with default active experts for each modality

block router with choice of conv, attention and etc

large register bank

conv encoders on characters and pixels (separate encoders)

distill from a bunch of SOTA models