TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/SyncVSR: Data-Efficient Visual Speech Recognition with End...

SyncVSR: Data-Efficient Visual Speech Recognition with End-to-End Crossmodal Audio Token Synchronization

Young Jin Ahn, Jungwoo Park, Sangha Park, Jonghyun Choi, Kee-Eung Kim

2024-06-18Speech RecognitionLandmark-based Lipreadingspeech-recognitionVisual Speech RecognitionLipreading
PaperPDFCode(official)

Abstract

Visual Speech Recognition (VSR) stands at the intersection of computer vision and speech recognition, aiming to interpret spoken content from visual cues. A prominent challenge in VSR is the presence of homophenes-visually similar lip gestures that represent different phonemes. Prior approaches have sought to distinguish fine-grained visemes by aligning visual and auditory semantics, but often fell short of full synchronization. To address this, we present SyncVSR, an end-to-end learning framework that leverages quantized audio for frame-level crossmodal supervision. By integrating a projection layer that synchronizes visual representation with acoustic data, our encoder learns to generate discrete audio tokens from a video sequence in a non-autoregressive manner. SyncVSR shows versatility across tasks, languages, and modalities at the cost of a forward pass. Our empirical evaluations show that it not only achieves state-of-the-art results but also reduces data usage by up to ninefold.

Results

TaskDatasetMetricValueModel
LipreadingCAS-VSR-W1k (LRW-1000)Top-1 Accuracy58.2SyncVSR (Word Boundary)
LipreadingLRS2Word Error Rate (WER)16.5SyncVSR
LipreadingLRS2Word Error Rate (WER)28.9SyncVSR
LipreadingLip Reading in the WildTop-1 Accuracy95SyncVSR (Word Boundary)
LipreadingLip Reading in the WildTop-1 Accuracy93.2SyncVSR
LipreadingLRS3-TEDWord Error Rate (WER)21.5SyncVSR
LipreadingLRS3-TEDWord Error Rate (WER)31.2SyncVSR
LipreadingLRWTop 1 Accuracy80.3SyncVSR (Word Boundary)
LipreadingLRWTop 1 Accuracy75.1SyncVSR
LipreadingLRS2Word Error Rate (WER)74.6SyncVSR
Natural Language TransductionCAS-VSR-W1k (LRW-1000)Top-1 Accuracy58.2SyncVSR (Word Boundary)
Natural Language TransductionLRS2Word Error Rate (WER)16.5SyncVSR
Natural Language TransductionLRS2Word Error Rate (WER)28.9SyncVSR
Natural Language TransductionLip Reading in the WildTop-1 Accuracy95SyncVSR (Word Boundary)
Natural Language TransductionLip Reading in the WildTop-1 Accuracy93.2SyncVSR
Natural Language TransductionLRS3-TEDWord Error Rate (WER)21.5SyncVSR
Natural Language TransductionLRS3-TEDWord Error Rate (WER)31.2SyncVSR
Natural Language TransductionLRWTop 1 Accuracy80.3SyncVSR (Word Boundary)
Natural Language TransductionLRWTop 1 Accuracy75.1SyncVSR
Natural Language TransductionLRS2Word Error Rate (WER)74.6SyncVSR

Related Papers

Task-Specific Audio Coding for Machines: Machine-Learned Latent Features Are Codes for That Machine2025-07-17NonverbalTTS: A Public English Corpus of Text-Aligned Nonverbal Vocalizations with Emotion Annotations for Text-to-Speech2025-07-17WhisperKit: On-device Real-time ASR with Billion-Scale Transformers2025-07-14VisualSpeaker: Visually-Guided 3D Avatar Lip Synthesis2025-07-08A Hybrid Machine Learning Framework for Optimizing Crop Selection via Agronomic and Economic Forecasting2025-07-06First Steps Towards Voice Anonymization for Code-Switching Speech2025-07-02MambAttention: Mamba with Multi-Head Attention for Generalizable Single-Channel Speech Enhancement2025-07-01AUTOMATIC PRONUNCIATION MISTAKE DETECTOR PROJECT REPORT2025-06-25