End-to-end Audio-visual Speech Recognition with Conformers

Pingchuan Ma, Stavros Petridis, Maja Pantic

2021-02-12Speech Recognition Automatic Speech Recognition (ASR)speech-recognition Audio-Visual Speech Recognition Visual Speech Recognition Lipreading Lip Reading Language Modelling

Paper PDF Code Code Code

Abstract

In this work, we present a hybrid CTC/Attention model based on a ResNet-18 and Convolution-augmented transformer (Conformer), that can be trained in an end-to-end manner. In particular, the audio and visual encoders learn to extract features directly from raw pixels and audio waveforms, respectively, which are then fed to conformers and then fusion takes place via a Multi-Layer Perceptron (MLP). The model learns to recognise characters using a combination of CTC and an attention mechanism. We show that end-to-end training, instead of using pre-computed visual features which is common in the literature, the use of a conformer, instead of a recurrent network, and the use of a transformer-based language model, significantly improve the performance of our model. We present results on the largest publicly available datasets for sentence-level speech recognition, Lip Reading Sentences 2 (LRS2) and Lip Reading Sentences 3 (LRS3), respectively. The results show that our proposed models raise the state-of-the-art performance by a large margin in audio-only, visual-only, and audio-visual experiments.

Results

Task	Dataset	Metric	Value	Model
Speech Recognition	LRS2	Test WER	3.9	End2end Conformer
Audio-Visual Speech Recognition	LRS3-TED	Word Error Rate (WER)	2.3	Hyb-Conformer
Audio-Visual Speech Recognition	LRS2	Test WER	3.7	End2end Conformer
Lipreading	LRS2	Word Error Rate (WER)	39.1	Hybrid CTC / Attention
Lipreading	LRS3-TED	Word Error Rate (WER)	43.3	Hyb + Conformer
Natural Language Transduction	LRS2	Word Error Rate (WER)	39.1	Hybrid CTC / Attention
Natural Language Transduction	LRS3-TED	Word Error Rate (WER)	43.3	Hyb + Conformer
Automatic Speech Recognition (ASR)	LRS2	Test WER	3.9	End2end Conformer

End-to-end Audio-visual Speech Recognition with Conformers

Abstract

Results

Related Papers

End-to-end Audio-visual Speech Recognition with Conformers

Abstract

Results

Related Papers