TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Deep Audio-Visual Speech Recognition

Deep Audio-Visual Speech Recognition

Triantafyllos Afouras, Joon Son Chung, Andrew Senior, Oriol Vinyals, Andrew Zisserman

2018-09-06Speech RecognitionAutomatic Speech Recognition (ASR)speech-recognitionAudio-Visual Speech RecognitionVisual Speech RecognitionLipreadingLip Reading
PaperPDFCodeCodeCodeCode

Abstract

The goal of this work is to recognise phrases and sentences being spoken by a talking face, with or without the audio. Unlike previous works that have focussed on recognising a limited number of words or phrases, we tackle lip reading as an open-world problem - unconstrained natural language sentences, and in the wild videos. Our key contributions are: (1) we compare two models for lip reading, one using a CTC loss, and the other using a sequence-to-sequence loss. Both models are built on top of the transformer self-attention architecture; (2) we investigate to what extent lip reading is complementary to audio speech recognition, especially when the audio signal is noisy; (3) we introduce and publicly release a new dataset for audio-visual speech recognition, LRS2-BBC, consisting of thousands of natural sentences from British television. The models that we train surpass the performance of all previous work on a lip reading benchmark dataset by a significant margin.

Results

TaskDatasetMetricValueModel
Speech RecognitionLRS2Test WER9.7TM-seq2seq
Speech RecognitionLRS2Test WER10.1TM-CTC
Audio-Visual Speech RecognitionLRS3-TEDWord Error Rate (WER)7.2TM-seq2seq
Audio-Visual Speech RecognitionLRS2Test WER8.2TM-CTC
Audio-Visual Speech RecognitionLRS2Test WER8.5TM-Seq2seq
LipreadingLRS2Word Error Rate (WER)48.3TM-seq2seq + extLM
LipreadingLRS2Word Error Rate (WER)54.7TM-CTC + extLM
LipreadingLRS3-TEDWord Error Rate (WER)58.9TM-seq2seq
Natural Language TransductionLRS2Word Error Rate (WER)48.3TM-seq2seq + extLM
Natural Language TransductionLRS2Word Error Rate (WER)54.7TM-CTC + extLM
Natural Language TransductionLRS3-TEDWord Error Rate (WER)58.9TM-seq2seq
Automatic Speech Recognition (ASR)LRS2Test WER9.7TM-seq2seq
Automatic Speech Recognition (ASR)LRS2Test WER10.1TM-CTC

Related Papers

Task-Specific Audio Coding for Machines: Machine-Learned Latent Features Are Codes for That Machine2025-07-17NonverbalTTS: A Public English Corpus of Text-Aligned Nonverbal Vocalizations with Emotion Annotations for Text-to-Speech2025-07-17WhisperKit: On-device Real-time ASR with Billion-Scale Transformers2025-07-14VisualSpeaker: Visually-Guided 3D Avatar Lip Synthesis2025-07-08A Hybrid Machine Learning Framework for Optimizing Crop Selection via Agronomic and Economic Forecasting2025-07-06First Steps Towards Voice Anonymization for Code-Switching Speech2025-07-02MambAttention: Mamba with Multi-Head Attention for Generalizable Single-Channel Speech Enhancement2025-07-01AUTOMATIC PRONUNCIATION MISTAKE DETECTOR PROJECT REPORT2025-06-25