Tasks
SotA
Datasets
Papers
Methods
Submit
About
SotA
/
Speech
/
Audio-Visual Speech Recognition
/
LRS3-TED
Audio-Visual Speech Recognition on LRS3-TED
Metric: Word Error Rate (WER) (lower is better)
Leaderboard
Dataset
Loading chart...
Results
Submit a result
Hide extra data
Export CSV
#
Model
↕
Word Error Rate (WER)
▲
Extra Data
Paper
Date
↕
Code
1
MMS-LLaMA
0.74
Yes
MMS-LLaMA: Efficient LLM-based Audio-Visual Spee...
2025-03-14
Code
2
Whisper-Flamingo
0.76
Yes
Whisper-Flamingo: Integrating Visual Features in...
2024-06-14
Code
3
Llama-AVSR
0.77
Yes
Large Language Models are Strong Audio-Visual Sp...
2024-09-18
Code
4
CTC/Attention
0.9
Yes
Auto-AVSR: Audio-Visual Speech Recognition with ...
2023-03-25
Code
5
DistillAV
1.3
Yes
Audio-Visual Representation Learning via Knowled...
2025-02-09
Code
6
AV-HuBERT Large
1.4
Yes
Robust Self-Supervised Audio-Visual Speech Recog...
2022-01-05
Code
7
RAVEn Large
1.4
Yes
Jointly Learning Visual and Auditory Speech Repr...
2022-12-12
Code
8
Zero-AVSR
1.5
Yes
Zero-AVSR: Zero-Shot Audio-Visual Speech Recogni...
2025-03-08
Code
9
Hyb-Conformer
2.3
No
End-to-end Audio-visual Speech Recognition with ...
2021-02-12
Code
10
RNN-T
4.5
Yes
Recurrent Neural Network Transducer for Audio-Vi...
2019-11-08
Code
11
EG-seq2seq
6.8
Yes
Discriminative Multi-modality Speech Recognition
2020-05-12
Code
12
TM-seq2seq
7.2
Yes
Deep Audio-Visual Speech Recognition
2018-09-06
Code