TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/End-to-end Audio-visual Speech Recognition with Conformers

End-to-end Audio-visual Speech Recognition with Conformers

Pingchuan Ma, Stavros Petridis, Maja Pantic

2021-02-12Speech RecognitionAutomatic Speech Recognition (ASR)speech-recognitionAudio-Visual Speech RecognitionVisual Speech RecognitionLipreadingLip ReadingLanguage Modelling
PaperPDFCodeCodeCode

Abstract

In this work, we present a hybrid CTC/Attention model based on a ResNet-18 and Convolution-augmented transformer (Conformer), that can be trained in an end-to-end manner. In particular, the audio and visual encoders learn to extract features directly from raw pixels and audio waveforms, respectively, which are then fed to conformers and then fusion takes place via a Multi-Layer Perceptron (MLP). The model learns to recognise characters using a combination of CTC and an attention mechanism. We show that end-to-end training, instead of using pre-computed visual features which is common in the literature, the use of a conformer, instead of a recurrent network, and the use of a transformer-based language model, significantly improve the performance of our model. We present results on the largest publicly available datasets for sentence-level speech recognition, Lip Reading Sentences 2 (LRS2) and Lip Reading Sentences 3 (LRS3), respectively. The results show that our proposed models raise the state-of-the-art performance by a large margin in audio-only, visual-only, and audio-visual experiments.

Results

TaskDatasetMetricValueModel
Speech RecognitionLRS2Test WER3.9End2end Conformer
Audio-Visual Speech RecognitionLRS3-TEDWord Error Rate (WER)2.3Hyb-Conformer
Audio-Visual Speech RecognitionLRS2Test WER3.7End2end Conformer
LipreadingLRS2Word Error Rate (WER)39.1Hybrid CTC / Attention
LipreadingLRS3-TEDWord Error Rate (WER)43.3Hyb + Conformer
Natural Language TransductionLRS2Word Error Rate (WER)39.1Hybrid CTC / Attention
Natural Language TransductionLRS3-TEDWord Error Rate (WER)43.3Hyb + Conformer
Automatic Speech Recognition (ASR)LRS2Test WER3.9End2end Conformer

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21Task-Specific Audio Coding for Machines: Machine-Learned Latent Features Are Codes for That Machine2025-07-17NonverbalTTS: A Public English Corpus of Text-Aligned Nonverbal Vocalizations with Emotion Annotations for Text-to-Speech2025-07-17Making Language Model a Hierarchical Classifier and Generator2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations2025-07-17Inverse Reinforcement Learning Meets Large Language Model Post-Training: Basics, Advances, and Opportunities2025-07-17Assay2Mol: large language model-based drug design using BioAssay context2025-07-16