Tasks
SotA
Datasets
Papers
Methods
Submit
About
SotA
/
Natural Language Processing
/
Natural Language Transduction
/
LRS2
Natural Language Transduction on LRS2
Metric: Word Error Rate (WER) (lower is better)
Leaderboard
Dataset
Loading chart...
Results
Submit a result
Hide extra data
Export CSV
Sort:
Word Error Rate (WER) (best first)
Word Error Rate (WER) (worst first)
Date (newest first)
Date (oldest first)
Model name (A→Z)
#
Model
↕
Word Error Rate (WER)
▲
Extra Data
Paper
Date
↕
Code
1
Auto-AVSR
14.6
Yes
Auto-AVSR: Audio-Visual Speech Recognition with ...
2023-03-25
Code
2
USR
15.4
Yes
Unified Speech Recognition: A Single Model for A...
2024-11-04
Code
3
SyncVSR
16.5
Yes
SyncVSR: Data-Efficient Visual Speech Recognitio...
2024-06-18
Code
4
RAVEn Large
18.6
Yes
Jointly Learning Visual and Auditory Speech Repr...
2022-12-12
Code
5
VTP (more data)
22.6
Yes
Sub-word Level Lip Reading With Visual Attention
2021-10-14
-
6
ES³ Large + extLM
24.6
Yes
-
-
-
7
CTC/Attention (LRW+LRS2/3+AVSpeech)
25.5
Yes
Visual Speech Recognition for Multiple Languages...
2022-02-26
Code
8
ES³ Large
26.7
Yes
-
-
-
9
ES³ Base + extLM
28.7
Yes
-
-
-
10
VTP
28.9
Yes
Sub-word Level Lip Reading With Visual Attention
2021-10-14
-
11
SyncVSR
28.9
No
SyncVSR: Data-Efficient Visual Speech Recognitio...
2024-06-18
Code
12
ES³ Base* + extLM
29.3
No
-
-
-
13
ES³ Base
30.7
Yes
-
-
-
14
ES³ Base*
31.4
No
-
-
-
15
CTC/Attention
32.9
No
Visual Speech Recognition for Multiple Languages...
2022-02-26
Code
16
Hybrid CTC / Attention
39.1
No
End-to-end Audio-visual Speech Recognition with ...
2021-02-12
Code
17
MoCo + wav2vec (w/o extLM)
43.2
No
Leveraging Unimodal Self-Supervised Learning for...
2022-02-24
Code
18
Multi-head Visual-Audio Memory
44.5
Yes
Distinguishing Homophenes Using Multi-Head Visua...
2022-04-04
Code
19
TM-seq2seq + extLM
48.3
Yes
Deep Audio-Visual Speech Recognition
2018-09-06
Code
20
LF-MMI TDNN
48.86
Yes
Audio-visual Recognition of Overlapped speech fo...
2020-01-06
-
21
Hybrid CTC / Attention
50
No
Audio-Visual Speech Recognition With A Hybrid CT...
2018-09-28
-
22
Conv-seq2seq
51.7
Yes
-
-
-
23
CTC + KD ASR
53.2
Yes
ASR is all you need: cross-modal distillation fo...
2019-11-28
-
24
TM-CTC + extLM
54.7
Yes
Deep Audio-Visual Speech Recognition
2018-09-06
Code
25
LIBS
65.29
No
Hearing Lips: Improving Lip Reading by Distillin...
2019-11-26
Code
26
SyncVSR
74.6
No
SyncVSR: Data-Efficient Visual Speech Recognitio...
2024-06-18
Code
#1
Auto-AVSR
SOTA
14.6
Word Error Rate (WER)
· Extra Data
· 2023-03-25
Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels
Code
#2
USR
15.4
Word Error Rate (WER)
· Extra Data
· 2024-11-04
Unified Speech Recognition: A Single Model for Auditory, Visual, and Audiovisual Inputs
Code
#3
SyncVSR
16.5
Word Error Rate (WER)
· Extra Data
· 2024-06-18
SyncVSR: Data-Efficient Visual Speech Recognition with End-to-End Crossmodal Audio Token Synchronization
Code
#4
RAVEn Large
SOTA
18.6
Word Error Rate (WER)
· Extra Data
· 2022-12-12
Jointly Learning Visual and Auditory Speech Representations from Raw Data
Code
#5
VTP (more data)
SOTA
22.6
Word Error Rate (WER)
· Extra Data
· 2021-10-14
Sub-word Level Lip Reading With Visual Attention
#6
ES³ Large + extLM
24.6
Word Error Rate (WER)
· Extra Data
No paper
#7
CTC/Attention (LRW+LRS2/3+AVSpeech)
25.5
Word Error Rate (WER)
· Extra Data
· 2022-02-26
Visual Speech Recognition for Multiple Languages in the Wild
Code
#8
ES³ Large
26.7
Word Error Rate (WER)
· Extra Data
No paper
#9
ES³ Base + extLM
28.7
Word Error Rate (WER)
· Extra Data
No paper
#10
VTP
SOTA
28.9
Word Error Rate (WER)
· Extra Data
· 2021-10-14
Sub-word Level Lip Reading With Visual Attention
#11
SyncVSR
28.9
Word Error Rate (WER)
· 2024-06-18
SyncVSR: Data-Efficient Visual Speech Recognition with End-to-End Crossmodal Audio Token Synchronization
Code
#12
ES³ Base* + extLM
29.3
Word Error Rate (WER)
No paper
#13
ES³ Base
30.7
Word Error Rate (WER)
· Extra Data
No paper
#14
ES³ Base*
31.4
Word Error Rate (WER)
No paper
#15
CTC/Attention
32.9
Word Error Rate (WER)
· 2022-02-26
Visual Speech Recognition for Multiple Languages in the Wild
Code
#16
Hybrid CTC / Attention
SOTA
39.1
Word Error Rate (WER)
· 2021-02-12
End-to-end Audio-visual Speech Recognition with Conformers
Code
#17
MoCo + wav2vec (w/o extLM)
43.2
Word Error Rate (WER)
· 2022-02-24
Leveraging Unimodal Self-Supervised Learning for Multimodal Audio-Visual Speech Recognition
Code
#18
Multi-head Visual-Audio Memory
44.5
Word Error Rate (WER)
· Extra Data
· 2022-04-04
Distinguishing Homophenes Using Multi-Head Visual-Audio Memory for Lip Reading
Code
#19
TM-seq2seq + extLM
SOTA
48.3
Word Error Rate (WER)
· Extra Data
· 2018-09-06
Deep Audio-Visual Speech Recognition
Code
#20
LF-MMI TDNN
48.86
Word Error Rate (WER)
· Extra Data
· 2020-01-06
Audio-visual Recognition of Overlapped speech for the LRS2 dataset
#21
Hybrid CTC / Attention
50
Word Error Rate (WER)
· 2018-09-28
Audio-Visual Speech Recognition With A Hybrid CTC/Attention Architecture
#22
Conv-seq2seq
51.7
Word Error Rate (WER)
· Extra Data
No paper
#23
CTC + KD ASR
53.2
Word Error Rate (WER)
· Extra Data
· 2019-11-28
ASR is all you need: cross-modal distillation for lip reading
#24
TM-CTC + extLM
SOTA
54.7
Word Error Rate (WER)
· Extra Data
· 2018-09-06
Deep Audio-Visual Speech Recognition
Code
#25
LIBS
65.29
Word Error Rate (WER)
· 2019-11-26
Hearing Lips: Improving Lip Reading by Distilling Speech Recognizers
Code
#26
SyncVSR
74.6
Word Error Rate (WER)
· 2024-06-18
SyncVSR: Data-Efficient Visual Speech Recognition with End-to-End Crossmodal Audio Token Synchronization
Code