Latent Transformers with small vocabularies
Use small vocab (or even character level) (ex 1024) with a 1D causal conv VAE to reduce embedding table size and sequence lengths
align from small token set to existing large tokenizers with CTC? (Sequence Modeling with CTC)