Speech - Speech Recognition and TTS

December 9, 2023 updated October 27, 2024 2 min read

GitHub - modelscope/ClearerVoice-Studio: An AI-Powered Speech Processing Toolkit and Open Source SOTA Pretrained Models, Supporting Speech Enhancement, Separation, and Target Speaker Extraction, etc.
[2412.10117] CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models
Introduction to audio data - Hugging Face Audio Course

Speech Recognition

Transducers - RNN-T

Separate source encoder for input sequence and “predictor” model that only predicts next token in output space (LLM), with a “Joiner” module that takes the encoder and predictor outputs and combines them to predict the next output.

good for streaming
+ can pretrain the predictor in self supervised fashion on next token prediction
Sequence-to-sequence learning with Transducers - Loren Lugosch
In Depth Explaination Of RNN-T based Automatic Speech Recognition Systems (ASR) - YouTube
- sites.cc.gatech.edu/classes/AY2021/cs7643_spring/assets/L24_rnnt_asr_tutorial_gt.pdf

HuBERT

[2106.07447] HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units

CTC

Decoding

Greedy

Beam Search

Conformer

[2005.08100] Conformer: Convolution-augmented Transformer for Speech Recognition

Whisper

Mamba

TTS - Text to Speech

F5-TTS

[2410.06885] F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching

CosyVoice 2

[2412.10117] CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models

End to End Multimodal Speech to Speech

Spirit LM

Links

asr ml