Latent Transformers with small vocabularies

Use small vocab (or even character level) (ex 1024) with a 1D causal conv VAE to reduce embedding table size and sequence lengths

align from small token set to existing large tokenizers with CTC? (Sequence Modeling with CTC)