Speech Recognition on LRS3-TED

Metric: Word Error Rate (WER) (lower is better)

LeaderboardDataset

Loading chart...

Results

Submit a result

Hide extra data

Sort:

#	Model↕	Word Error Rate (WER)▲	Extra Data	Paper	Date↕	Code
1	Whisper	0.68	Yes	Whisper-Flamingo: Integrating Visual Features in...	2024-06-14	Code
2	Llama-AVSR	0.81	Yes	Large Language Models are Strong Audio-Visual Sp...	2024-09-18	Code
3	CTC/Attention	1	No	Auto-AVSR: Audio-Visual Speech Recognition with ...	2023-03-25	Code
4	AV-HuBERT Large	1.3	Yes	Learning Audio-Visual Speech Representation by M...	2022-01-05	Code
5	RAVEn Large	1.4	Yes	Jointly Learning Visual and Auditory Speech Repr...	2022-12-12	Code
6	CTC/Attention	19.1	Yes	Auto-AVSR: Audio-Visual Speech Recognition with ...	2023-03-25	Code
7	VTP with more data	30.7	Yes	Sub-word Level Lip Reading With Visual Attention	2021-10-14	-
8	VTP	40.6	Yes	Sub-word Level Lip Reading With Visual Attention	2021-10-14	-

#1WhisperSOTA
0.68
Word Error Rate (WER)· Extra Data· 2024-06-14
Whisper-Flamingo: Integrating Visual Features into Whisper for Audio-Visual Speech Recognition and Translation Code
#2Llama-AVSR
0.81
Word Error Rate (WER)· Extra Data· 2024-09-18
Large Language Models are Strong Audio-Visual Speech Recognition Learners Code
#3CTC/AttentionSOTA
1
Word Error Rate (WER)· 2023-03-25
Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels Code
#4AV-HuBERT LargeSOTA
1.3
Word Error Rate (WER)· Extra Data· 2022-01-05
Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction Code
#5RAVEn Large
1.4
Word Error Rate (WER)· Extra Data· 2022-12-12
Jointly Learning Visual and Auditory Speech Representations from Raw Data Code
#6CTC/Attention
19.1
Word Error Rate (WER)· Extra Data· 2023-03-25
Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels Code
#7VTP with more dataSOTA
30.7
Word Error Rate (WER)· Extra Data· 2021-10-14
Sub-word Level Lip Reading With Visual Attention
#8VTPSOTA
40.6
Word Error Rate (WER)· Extra Data· 2021-10-14
Sub-word Level Lip Reading With Visual Attention