Tasks
SotA
Datasets
Papers
Methods
Submit
About
SotA
/
Computer Vision
/
Lipreading
/
LRS3-TED
Lipreading on LRS3-TED
Metric: Word Error Rate (WER) (lower is better)
Leaderboard
Dataset
Loading chart...
Results
Submit a result
Hide extra data
Export CSV
Sort:
Word Error Rate (WER) (best first)
Word Error Rate (WER) (worst first)
Date (newest first)
Date (oldest first)
Model name (A→Z)
#
Model
↕
Word Error Rate (WER)
▲
Extra Data
Paper
Date
↕
Code
1
LP + Conformer
12.8
Yes
Conformers are All You Need for Visual Speech Re...
2023-02-17
-
2
Auto-AVSR
19.1
Yes
Auto-AVSR: Audio-Visual Speech Recognition with ...
2023-03-25
Code
3
SyncVSR
21.5
Yes
SyncVSR: Data-Efficient Visual Speech Recognitio...
2024-06-18
Code
4
USR (self + semi-supervised)
21.5
Yes
Unified Speech Recognition: A Single Model for A...
2024-11-04
Code
5
USR (self-supervised)
22.3
Yes
Unified Speech Recognition: A Single Model for A...
2024-11-04
Code
6
RAVEn Large
23.4
Yes
Jointly Learning Visual and Auditory Speech Repr...
2022-12-12
Code
7
VSP-LLM
25.4
Yes
Where Visual Speech Meets Language: VSP-LLM Fram...
2024-02-23
Code
8
AV-HuBERT Large + Relaxed Attention + LM
25.51
Yes
Relaxed Attention for Transformer Models
2022-09-20
Code
9
DistillAV
26.2
Yes
Audio-Visual Representation Learning via Knowled...
2025-02-09
Code
10
AV-HuBERT Large
26.9
Yes
Learning Audio-Visual Speech Representation by M...
2022-01-05
Code
11
VTP (more data)
30.7
Yes
Sub-word Level Lip Reading With Visual Attention
2021-10-14
-
12
SyncVSR
31.2
No
SyncVSR: Data-Efficient Visual Speech Recognitio...
2024-06-18
Code
13
CTC/Attention (LRW+LRS2/3+AVSpeech)
31.5
Yes
Visual Speech Recognition for Multiple Languages...
2022-02-26
Code
14
RNN-T
33.6
Yes
Recurrent Neural Network Transducer for Audio-Vi...
2019-11-08
Code
15
ES³ Large
37.1
No
-
-
-
16
ES³ Base
40.3
No
-
-
-
17
VTP
40.6
Yes
Sub-word Level Lip Reading With Visual Attention
2021-10-14
-
18
Hyb + Conformer
43.3
Yes
End-to-end Audio-visual Speech Recognition with ...
2021-02-12
Code
19
CTC-V2P
55.1
Yes
Large-Scale Visual Speech Recognition
2018-07-13
-
20
EG-seq2seq
57.8
No
Discriminative Multi-modality Speech Recognition
2020-05-12
Code
21
TM-seq2seq
58.9
Yes
Deep Audio-Visual Speech Recognition
2018-09-06
Code
22
CTC + KD
59.8
Yes
ASR is all you need: cross-modal distillation fo...
2019-11-28
-
23
Conv-seq2seq
60.1
Yes
-
-
-
#1
LP + Conformer
SOTA
12.8
Word Error Rate (WER)
· Extra Data
· 2023-02-17
Conformers are All You Need for Visual Speech Recognition
#2
Auto-AVSR
19.1
Word Error Rate (WER)
· Extra Data
· 2023-03-25
Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels
Code
#3
SyncVSR
21.5
Word Error Rate (WER)
· Extra Data
· 2024-06-18
SyncVSR: Data-Efficient Visual Speech Recognition with End-to-End Crossmodal Audio Token Synchronization
Code
#4
USR (self + semi-supervised)
21.5
Word Error Rate (WER)
· Extra Data
· 2024-11-04
Unified Speech Recognition: A Single Model for Auditory, Visual, and Audiovisual Inputs
Code
#5
USR (self-supervised)
22.3
Word Error Rate (WER)
· Extra Data
· 2024-11-04
Unified Speech Recognition: A Single Model for Auditory, Visual, and Audiovisual Inputs
Code
#6
RAVEn Large
SOTA
23.4
Word Error Rate (WER)
· Extra Data
· 2022-12-12
Jointly Learning Visual and Auditory Speech Representations from Raw Data
Code
#7
VSP-LLM
25.4
Word Error Rate (WER)
· Extra Data
· 2024-02-23
Where Visual Speech Meets Language: VSP-LLM Framework for Efficient and Context-Aware Visual Speech Processing
Code
#8
AV-HuBERT Large + Relaxed Attention + LM
SOTA
25.51
Word Error Rate (WER)
· Extra Data
· 2022-09-20
Relaxed Attention for Transformer Models
Code
#9
DistillAV
26.2
Word Error Rate (WER)
· Extra Data
· 2025-02-09
Audio-Visual Representation Learning via Knowledge Distillation from Speech Foundation Models
Code
#10
AV-HuBERT Large
SOTA
26.9
Word Error Rate (WER)
· Extra Data
· 2022-01-05
Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction
Code
#11
VTP (more data)
SOTA
30.7
Word Error Rate (WER)
· Extra Data
· 2021-10-14
Sub-word Level Lip Reading With Visual Attention
#12
SyncVSR
31.2
Word Error Rate (WER)
· 2024-06-18
SyncVSR: Data-Efficient Visual Speech Recognition with End-to-End Crossmodal Audio Token Synchronization
Code
#13
CTC/Attention (LRW+LRS2/3+AVSpeech)
31.5
Word Error Rate (WER)
· Extra Data
· 2022-02-26
Visual Speech Recognition for Multiple Languages in the Wild
Code
#14
RNN-T
SOTA
33.6
Word Error Rate (WER)
· Extra Data
· 2019-11-08
Recurrent Neural Network Transducer for Audio-Visual Speech Recognition
Code
#15
ES³ Large
37.1
Word Error Rate (WER)
No paper
#16
ES³ Base
40.3
Word Error Rate (WER)
No paper
#17
VTP
40.6
Word Error Rate (WER)
· Extra Data
· 2021-10-14
Sub-word Level Lip Reading With Visual Attention
#18
Hyb + Conformer
43.3
Word Error Rate (WER)
· Extra Data
· 2021-02-12
End-to-end Audio-visual Speech Recognition with Conformers
Code
#19
CTC-V2P
SOTA
55.1
Word Error Rate (WER)
· Extra Data
· 2018-07-13
Large-Scale Visual Speech Recognition
#20
EG-seq2seq
57.8
Word Error Rate (WER)
· 2020-05-12
Discriminative Multi-modality Speech Recognition
Code
#21
TM-seq2seq
58.9
Word Error Rate (WER)
· Extra Data
· 2018-09-06
Deep Audio-Visual Speech Recognition
Code
#22
CTC + KD
59.8
Word Error Rate (WER)
· Extra Data
· 2019-11-28
ASR is all you need: cross-modal distillation for lip reading
#23
Conv-seq2seq
60.1
Word Error Rate (WER)
· Extra Data
No paper