Speech Recognition

asr

Transducers - RNN-T

Separate source encoder for input sequence and “predictor” model that only predicts next token in output space (LLM), with a “Joiner” module that takes the encoder and predictor outputs and combines them to predict the next output.

HuBERT

CTC

Decoding

Greedy

Conformer

Whisper

Mamba

TTS - Text to Speech

F5-TTS

End to End Multimodal Speech to Speech

Spirit LM