Visual Speech Recognition for Multiple Languages in the Wild

Pingchuan Ma, Stavros Petridis, Maja Pantic

2022-02-26Speech Recognition speech-recognition Hyperparameter Optimization Visual Speech Recognition Lipreading

Abstract

Visual speech recognition (VSR) aims to recognize the content of speech based on lip movements, without relying on the audio stream. Advances in deep learning and the availability of large audio-visual datasets have led to the development of much more accurate and robust VSR models than ever before. However, these advances are usually due to the larger training sets rather than the model design. Here we demonstrate that designing better models is equally as important as using larger training sets. We propose the addition of prediction-based auxiliary tasks to a VSR model, and highlight the importance of hyperparameter optimization and appropriate data augmentations. We show that such a model works for different languages and outperforms all previous methods trained on publicly available datasets by a large margin. It even outperforms models that were trained on non-publicly available datasets containing up to to 21 times more data. We show, furthermore, that using additional training data, even in other languages or with automatically generated transcriptions, results in further improvement.

Results

Task	Dataset	Metric	Value	Model
Lipreading	LRS2	Word Error Rate (WER)	25.5	CTC/Attention (LRW+LRS2/3+AVSpeech)
Lipreading	LRS2	Word Error Rate (WER)	32.9	CTC/Attention
Lipreading	LRS3-TED	Word Error Rate (WER)	31.5	CTC/Attention (LRW+LRS2/3+AVSpeech)
Lipreading	GRID corpus (mixed-speech)	Word Error Rate (WER)	1.2	CTC/Attention
Natural Language Transduction	LRS2	Word Error Rate (WER)	25.5	CTC/Attention (LRW+LRS2/3+AVSpeech)
Natural Language Transduction	LRS2	Word Error Rate (WER)	32.9	CTC/Attention
Natural Language Transduction	LRS3-TED	Word Error Rate (WER)	31.5	CTC/Attention (LRW+LRS2/3+AVSpeech)
Natural Language Transduction	GRID corpus (mixed-speech)	Word Error Rate (WER)	1.2	CTC/Attention

Visual Speech Recognition for Multiple Languages in the Wild

Abstract

Results

Related Papers

Visual Speech Recognition for Multiple Languages in the Wild

Abstract

Results

Related Papers