Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels

Pingchuan Ma, Alexandros Haliassos, Adriana Fernandez-Lopez, Honglie Chen, Stavros Petridis, Maja Pantic

2023-03-25Speech Recognition Automatic Speech Recognition Automatic Speech Recognition (ASR)speech-recognition Audio-Visual Speech Recognition Visual Speech Recognition Lipreading

Paper PDF Code(official)Code

Abstract

Audio-visual speech recognition has received a lot of attention due to its robustness against acoustic noise. Recently, the performance of automatic, visual, and audio-visual speech recognition (ASR, VSR, and AV-ASR, respectively) has been substantially improved, mainly due to the use of larger models and training sets. However, accurate labelling of datasets is time-consuming and expensive. Hence, in this work, we investigate the use of automatically-generated transcriptions of unlabelled datasets to increase the training set size. For this purpose, we use publicly-available pre-trained ASR models to automatically transcribe unlabelled datasets such as AVSpeech and VoxCeleb2. Then, we train ASR, VSR and AV-ASR models on the augmented training set, which consists of the LRS2 and LRS3 datasets as well as the additional automatically-transcribed data. We demonstrate that increasing the size of the training set, a recent trend in the literature, leads to reduced WER despite using noisy transcriptions. The proposed model achieves new state-of-the-art performance on AV-ASR on LRS2 and LRS3. In particular, it achieves a WER of 0.9% on LRS3, a relative improvement of 30% over the current state-of-the-art approach, and outperforms methods that have been trained on non-publicly available datasets with 26 times more training data.

Results

Task	Dataset	Metric	Value	Model
Speech Recognition	LRS2	Test WER	1.5	CTC/Attention
Speech Recognition	LRS3-TED	Word Error Rate (WER)	1	CTC/Attention
Speech Recognition	LRS3-TED	Word Error Rate (WER)	19.1	CTC/Attention
Audio-Visual Speech Recognition	LRS3-TED	Word Error Rate (WER)	0.9	CTC/Attention
Audio-Visual Speech Recognition	LRS2	Test WER	1.5	CTC/Attention
Lipreading	LRS2	Word Error Rate (WER)	14.6	Auto-AVSR
Lipreading	LRS3-TED	Word Error Rate (WER)	19.1	Auto-AVSR
Natural Language Transduction	LRS2	Word Error Rate (WER)	14.6	Auto-AVSR
Natural Language Transduction	LRS3-TED	Word Error Rate (WER)	19.1	Auto-AVSR
Visual Speech Recognition	LRS3-TED	Word Error Rate (WER)	19.1	CTC/Attention
Automatic Speech Recognition (ASR)	LRS2	Test WER	1.5	CTC/Attention
Automatic Speech Recognition (ASR)	LRS3-TED	Word Error Rate (WER)	1	CTC/Attention

Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels

Abstract

Results

Related Papers

Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels

Abstract

Results

Related Papers