michal.i/o

❯

❯

❯

speech

Dec 22, 20242 min read

asr

GitHub - modelscope/ClearerVoice-Studio: An AI-Powered Speech Processing Toolkit and Open Source SOTA Pretrained Models, Supporting Speech Enhancement, Separation, and Target Speaker Extraction, etc.

Speech Recognition

Open ASR Leaderboard - a Hugging Face Space by hf-audio
https://github.com/openai/whisper
https://github.com/ggerganov/whisper.cpp
ylacombe/whisper-large-v3-turbo · Hugging Face
nvidia/canary-1b · Hugging Face
Whisper - a mlx-community Collection
GitHub - sindresorhus/awesome-whisper: 🔊 Awesome list for Whisper — an open-source AI-powered speech recognition system developed by OpenAI

Transducers - RNN-T

Separate source encoder for input sequence and “predictor” model that only predicts next token in output space (LLM), with a “Joiner” module that takes the encoder and predictor outputs and combines them to predict the next output.

good for streaming
+ can pretrain the predictor in self supervised fashion on next token prediction
Sequence-to-sequence learning with Transducers - Loren Lugosch
In Depth Explaination Of RNN-T based Automatic Speech Recognition Systems (ASR) - YouTube
- sites.cc.gatech.edu/classes/AY2021/cs7643_spring/assets/L24_rnnt_asr_tutorial_gt.pdf

HuBERT

[2106.07447] HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units

CTC

Decoding

Greedy

Beam Search

Conformer

[2005.08100] Conformer: Convolution-augmented Transformer for Speech Recognition

Whisper

Mamba

TTS - Text to Speech

TTS Arena - a Hugging Face Space by TTS-AGI
GitHub - fishaudio/fish-speech: SOTA Open Source TTS
GitHub - edwko/OuteTTS: Interface for OuteTTS models.

F5-TTS

[2410.06885] F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching

CosyVoice 2

[2412.10117] CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models

End to End Multimodal Speech to Speech

Spirit LM

GitHub - facebookresearch/spiritlm: Inference code for the paper “Spirit-LM Interleaved Spoken and Written Language Model”.
SpiRit-LM, an Interleaved Spoken and Written Language Model | Multimodal Weekly 47 - YouTube

Links

Olewave - YouTube
WAVLab - YouTube
Latest Advancements in Speech Recognition - YouTube
Hearing the AGI from GMM HMM to GPT 4o Yu Zhang - November 15th LTI Colloquium Speaker - Yu Zhang - YouTube

Speech Recognition
Transducers - RNN-T
HuBERT
CTC
Decoding
Conformer
Whisper
Mamba
TTS - Text to Speech
F5-TTS
CosyVoice 2
End to End Multimodal Speech to Speech
Spirit LM
Links

Backlinks

No backlinks found

Graph View

Created with Quartz v4.4.0 © 2024