DistillAV

Reported on 5 benchmarks across 5 tasks · 1 paper · 2 SOTA

Note: results are matched by exact model name. Different papers may use the same name for different model variants.

Speech2 results

Automatic Speech Recognition (ASR)onLRS3-TED
WER· 2025-02-09
1.4
SOTA
Audio-Visual Representation Learning via Knowledge Distillation from Speech Foundation Models arXiv:2502.05766
Audio-Visual Speech RecognitiononLRS3-TED
Word Error Rate (WER)· uses extra data· 2025-02-09
1.3
best: 0.74 (MMS-LLaMA)
Audio-Visual Representation Learning via Knowledge Distillation from Speech Foundation Models arXiv:2502.05766

Audio1 result

Speech RecognitiononLRS3-TED
WER· 2025-02-09
1.4
SOTA
Audio-Visual Representation Learning via Knowledge Distillation from Speech Foundation Models arXiv:2502.05766

Computer Vision1 result

LipreadingonLRS3-TED
Word Error Rate (WER)· uses extra data· 2025-02-09
26.2
best: 12.8 (LP + Conformer)
Audio-Visual Representation Learning via Knowledge Distillation from Speech Foundation Models arXiv:2502.05766

Natural Language Processing1 result

Natural Language TransductiononLRS3-TED
Word Error Rate (WER)· uses extra data· 2025-02-09
26.2
best: 12.8 (LP + Conformer)
Audio-Visual Representation Learning via Knowledge Distillation from Speech Foundation Models arXiv:2502.05766