TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Visual Speech Recognition for Multiple Languages in the Wild

Visual Speech Recognition for Multiple Languages in the Wild

Pingchuan Ma, Stavros Petridis, Maja Pantic

2022-02-26Speech Recognitionspeech-recognitionHyperparameter OptimizationVisual Speech RecognitionLipreading
PaperPDFCode(official)Code

Abstract

Visual speech recognition (VSR) aims to recognize the content of speech based on lip movements, without relying on the audio stream. Advances in deep learning and the availability of large audio-visual datasets have led to the development of much more accurate and robust VSR models than ever before. However, these advances are usually due to the larger training sets rather than the model design. Here we demonstrate that designing better models is equally as important as using larger training sets. We propose the addition of prediction-based auxiliary tasks to a VSR model, and highlight the importance of hyperparameter optimization and appropriate data augmentations. We show that such a model works for different languages and outperforms all previous methods trained on publicly available datasets by a large margin. It even outperforms models that were trained on non-publicly available datasets containing up to to 21 times more data. We show, furthermore, that using additional training data, even in other languages or with automatically generated transcriptions, results in further improvement.

Results

TaskDatasetMetricValueModel
LipreadingLRS2Word Error Rate (WER)25.5CTC/Attention (LRW+LRS2/3+AVSpeech)
LipreadingLRS2Word Error Rate (WER)32.9CTC/Attention
LipreadingLRS3-TEDWord Error Rate (WER)31.5CTC/Attention (LRW+LRS2/3+AVSpeech)
LipreadingGRID corpus (mixed-speech)Word Error Rate (WER)1.2CTC/Attention
Natural Language TransductionLRS2Word Error Rate (WER)25.5CTC/Attention (LRW+LRS2/3+AVSpeech)
Natural Language TransductionLRS2Word Error Rate (WER)32.9CTC/Attention
Natural Language TransductionLRS3-TEDWord Error Rate (WER)31.5CTC/Attention (LRW+LRS2/3+AVSpeech)
Natural Language TransductionGRID corpus (mixed-speech)Word Error Rate (WER)1.2CTC/Attention

Related Papers

Task-Specific Audio Coding for Machines: Machine-Learned Latent Features Are Codes for That Machine2025-07-17NonverbalTTS: A Public English Corpus of Text-Aligned Nonverbal Vocalizations with Emotion Annotations for Text-to-Speech2025-07-17Are encoders able to learn landmarkers for warm-starting of Hyperparameter Optimization?2025-07-16WhisperKit: On-device Real-time ASR with Billion-Scale Transformers2025-07-14VisualSpeaker: Visually-Guided 3D Avatar Lip Synthesis2025-07-08A Hybrid Machine Learning Framework for Optimizing Crop Selection via Agronomic and Economic Forecasting2025-07-06First Steps Towards Voice Anonymization for Code-Switching Speech2025-07-02MambAttention: Mamba with Multi-Head Attention for Generalizable Single-Channel Speech Enhancement2025-07-01