TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Auto-AVSR: Audio-Visual Speech Recognition with Automatic ...

Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels

Pingchuan Ma, Alexandros Haliassos, Adriana Fernandez-Lopez, Honglie Chen, Stavros Petridis, Maja Pantic

2023-03-25Speech RecognitionAutomatic Speech RecognitionAutomatic Speech Recognition (ASR)speech-recognitionAudio-Visual Speech RecognitionVisual Speech RecognitionLipreading
PaperPDFCode(official)Code

Abstract

Audio-visual speech recognition has received a lot of attention due to its robustness against acoustic noise. Recently, the performance of automatic, visual, and audio-visual speech recognition (ASR, VSR, and AV-ASR, respectively) has been substantially improved, mainly due to the use of larger models and training sets. However, accurate labelling of datasets is time-consuming and expensive. Hence, in this work, we investigate the use of automatically-generated transcriptions of unlabelled datasets to increase the training set size. For this purpose, we use publicly-available pre-trained ASR models to automatically transcribe unlabelled datasets such as AVSpeech and VoxCeleb2. Then, we train ASR, VSR and AV-ASR models on the augmented training set, which consists of the LRS2 and LRS3 datasets as well as the additional automatically-transcribed data. We demonstrate that increasing the size of the training set, a recent trend in the literature, leads to reduced WER despite using noisy transcriptions. The proposed model achieves new state-of-the-art performance on AV-ASR on LRS2 and LRS3. In particular, it achieves a WER of 0.9% on LRS3, a relative improvement of 30% over the current state-of-the-art approach, and outperforms methods that have been trained on non-publicly available datasets with 26 times more training data.

Results

TaskDatasetMetricValueModel
Speech RecognitionLRS2Test WER1.5CTC/Attention
Speech RecognitionLRS3-TEDWord Error Rate (WER)1CTC/Attention
Speech RecognitionLRS3-TEDWord Error Rate (WER)19.1CTC/Attention
Audio-Visual Speech RecognitionLRS3-TEDWord Error Rate (WER)0.9CTC/Attention
Audio-Visual Speech RecognitionLRS2Test WER1.5CTC/Attention
LipreadingLRS2Word Error Rate (WER)14.6Auto-AVSR
LipreadingLRS3-TEDWord Error Rate (WER)19.1Auto-AVSR
Natural Language TransductionLRS2Word Error Rate (WER)14.6Auto-AVSR
Natural Language TransductionLRS3-TEDWord Error Rate (WER)19.1Auto-AVSR
Visual Speech RecognitionLRS3-TEDWord Error Rate (WER)19.1CTC/Attention
Automatic Speech Recognition (ASR)LRS2Test WER1.5CTC/Attention
Automatic Speech Recognition (ASR)LRS3-TEDWord Error Rate (WER)1CTC/Attention

Related Papers

Task-Specific Audio Coding for Machines: Machine-Learned Latent Features Are Codes for That Machine2025-07-17NonverbalTTS: A Public English Corpus of Text-Aligned Nonverbal Vocalizations with Emotion Annotations for Text-to-Speech2025-07-17WhisperKit: On-device Real-time ASR with Billion-Scale Transformers2025-07-14VisualSpeaker: Visually-Guided 3D Avatar Lip Synthesis2025-07-08A Hybrid Machine Learning Framework for Optimizing Crop Selection via Agronomic and Economic Forecasting2025-07-06First Steps Towards Voice Anonymization for Code-Switching Speech2025-07-02MambAttention: Mamba with Multi-Head Attention for Generalizable Single-Channel Speech Enhancement2025-07-01AUTOMATIC PRONUNCIATION MISTAKE DETECTOR PROJECT REPORT2025-06-25