Tasks
SotA
Datasets
Papers
Methods
Submit
About
SotA
/
Speech
/
Audio-Visual Speech Recognition
/
LRS3-TED
Audio-Visual Speech Recognition on LRS3-TED
Metric: Word Error Rate (WER) (lower is better)
Leaderboard
Dataset
Loading chart...
Results
Submit a result
Hide extra data
Export CSV
Sort:
Word Error Rate (WER) (best first)
Word Error Rate (WER) (worst first)
Date (newest first)
Date (oldest first)
Model name (A→Z)
#
Model
↕
Word Error Rate (WER)
▲
Extra Data
Paper
Date
↕
Code
1
MMS-LLaMA
0.74
Yes
MMS-LLaMA: Efficient LLM-based Audio-Visual Spee...
2025-03-14
Code
2
Whisper-Flamingo
0.76
Yes
Whisper-Flamingo: Integrating Visual Features in...
2024-06-14
Code
3
Llama-AVSR
0.77
Yes
Large Language Models are Strong Audio-Visual Sp...
2024-09-18
Code
4
CTC/Attention
0.9
Yes
Auto-AVSR: Audio-Visual Speech Recognition with ...
2023-03-25
Code
5
DistillAV
1.3
Yes
Audio-Visual Representation Learning via Knowled...
2025-02-09
Code
6
AV-HuBERT Large
1.4
Yes
Robust Self-Supervised Audio-Visual Speech Recog...
2022-01-05
Code
7
RAVEn Large
1.4
Yes
Jointly Learning Visual and Auditory Speech Repr...
2022-12-12
Code
8
Zero-AVSR
1.5
Yes
Zero-AVSR: Zero-Shot Audio-Visual Speech Recogni...
2025-03-08
Code
9
Hyb-Conformer
2.3
No
End-to-end Audio-visual Speech Recognition with ...
2021-02-12
Code
10
RNN-T
4.5
Yes
Recurrent Neural Network Transducer for Audio-Vi...
2019-11-08
Code
11
EG-seq2seq
6.8
Yes
Discriminative Multi-modality Speech Recognition
2020-05-12
Code
12
TM-seq2seq
7.2
Yes
Deep Audio-Visual Speech Recognition
2018-09-06
Code
#1
MMS-LLaMA
SOTA
0.74
Word Error Rate (WER)
· Extra Data
· 2025-03-14
MMS-LLaMA: Efficient LLM-based Audio-Visual Speech Recognition with Minimal Multimodal Speech Tokens
Code
#2
Whisper-Flamingo
SOTA
0.76
Word Error Rate (WER)
· Extra Data
· 2024-06-14
Whisper-Flamingo: Integrating Visual Features into Whisper for Audio-Visual Speech Recognition and Translation
Code
#3
Llama-AVSR
0.77
Word Error Rate (WER)
· Extra Data
· 2024-09-18
Large Language Models are Strong Audio-Visual Speech Recognition Learners
Code
#4
CTC/Attention
SOTA
0.9
Word Error Rate (WER)
· Extra Data
· 2023-03-25
Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels
Code
#5
DistillAV
1.3
Word Error Rate (WER)
· Extra Data
· 2025-02-09
Audio-Visual Representation Learning via Knowledge Distillation from Speech Foundation Models
Code
#6
AV-HuBERT Large
SOTA
1.4
Word Error Rate (WER)
· Extra Data
· 2022-01-05
Robust Self-Supervised Audio-Visual Speech Recognition
Code
#7
RAVEn Large
1.4
Word Error Rate (WER)
· Extra Data
· 2022-12-12
Jointly Learning Visual and Auditory Speech Representations from Raw Data
Code
#8
Zero-AVSR
1.5
Word Error Rate (WER)
· Extra Data
· 2025-03-08
Zero-AVSR: Zero-Shot Audio-Visual Speech Recognition with LLMs by Learning Language-Agnostic Speech Representations
Code
#9
Hyb-Conformer
SOTA
2.3
Word Error Rate (WER)
· 2021-02-12
End-to-end Audio-visual Speech Recognition with Conformers
Code
#10
RNN-T
SOTA
4.5
Word Error Rate (WER)
· Extra Data
· 2019-11-08
Recurrent Neural Network Transducer for Audio-Visual Speech Recognition
Code
#11
EG-seq2seq
6.8
Word Error Rate (WER)
· Extra Data
· 2020-05-12
Discriminative Multi-modality Speech Recognition
Code
#12
TM-seq2seq
SOTA
7.2
Word Error Rate (WER)
· Extra Data
· 2018-09-06
Deep Audio-Visual Speech Recognition
Code