AV-HuBERT Large

Reported on 4 benchmarks across 4 tasks · 2 papers · 4 SOTA

Note: results are matched by exact model name. Different papers may use the same name for different model variants.

Audio1 result

Speech RecognitiononLRS3-TED
Word Error Rate (WER)· uses extra data· 2022-01-05
1.3
best: 0.68 (Whisper)
SOTA
Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction arXiv:2201.02184

Speech1 result

Audio-Visual Speech RecognitiononLRS3-TED
Word Error Rate (WER)· uses extra data· 2022-01-05
1.4
best: 0.74 (MMS-LLaMA)
SOTA
Robust Self-Supervised Audio-Visual Speech Recognition arXiv:2201.01763

Computer Vision1 result

LipreadingonLRS3-TED
Word Error Rate (WER)· uses extra data· 2022-01-05
26.9
best: 12.8 (LP + Conformer)
SOTA
Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction arXiv:2201.02184

Natural Language Processing1 result

Natural Language TransductiononLRS3-TED
Word Error Rate (WER)· uses extra data· 2022-01-05
26.9
best: 12.8 (LP + Conformer)
SOTA
Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction arXiv:2201.02184