TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Learning Individual Speaking Styles for Accurate Lip to Sp...

Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis

K R Prajwal, Rudrabha Mukhopadhyay, Vinay Namboodiri, C. V. Jawahar

2020-05-17CVPR 2020 6Speech SynthesisLip ReadingSpeaker-Specific Lip to Speech SynthesisLip to Speech Synthesis
PaperPDFCode(official)

Abstract

Humans involuntarily tend to infer parts of the conversation from lip movements when the speech is absent or corrupted by external noise. In this work, we explore the task of lip to speech synthesis, i.e., learning to generate natural speech given only the lip movements of a speaker. Acknowledging the importance of contextual and speaker-specific cues for accurate lip-reading, we take a different path from existing works. We focus on learning accurate lip sequences to speech mappings for individual speakers in unconstrained, large vocabulary settings. To this end, we collect and release a large-scale benchmark dataset, the first of its kind, specifically to train and evaluate the single-speaker lip to speech task in natural settings. We propose a novel approach with key design choices to achieve accurate, natural lip to speech synthesis in such unconstrained scenarios for the first time. Extensive evaluation using quantitative, qualitative metrics and human evaluation shows that our method is four times more intelligible than previous works in this space. Please check out our demo video for a quick overview of the paper, method, and qualitative results. https://www.youtube.com/watch?v=HziA-jmlk_4&feature=youtu.be

Results

TaskDatasetMetricValueModel
Speech RecognitionLRWESTOI0.344Lip2Wav
Speech RecognitionLRWPESQ1.197Lip2Wav
Speech RecognitionLRWSTOI0.543Lip2Wav
Speech RecognitionLip2Wav (EH)ESTOI0.22Lip2Wav
Speech RecognitionLip2Wav (EH)PESQ1.367Lip2Wav
Speech RecognitionLip2Wav (EH)STOI0.369Lip2Wav
Speech RecognitionLip2Wav (Chess)ESTOI0.29Lip2Wav
Speech RecognitionLip2Wav (Chess)PESQ1.4Lip2Wav
Speech RecognitionLip2Wav (Chess)STOI0.418Lip2Wav
Speech RecognitionLip2Wav (DL)ESTOI0.183Lip2Wav
Speech RecognitionLip2Wav (DL)PESQ1.671Lip2Wav
Speech RecognitionLip2Wav (DL)STOI0.282Lip2Wav
Speech RecognitionLip2Wav (HS)ESTOI0.311Lip2Wav
Speech RecognitionLip2Wav (HS)PESQ1.29Lip2Wav
Speech RecognitionLip2Wav (HS)STOI0.446Lip2Wav
Speech RecognitionLip2Wav (Chem)ESTOI0.284Lip2Wav
Speech RecognitionLip2Wav (Chem)PESQ1.3Lip2Wav
Speech RecognitionLip2Wav (Chem)STOI0.416Lip2Wav
Speech RecognitionTCD-TIMIT corpus (mixed-speech)ESTOI36.5Lip2Wav
Speech RecognitionTCD-TIMIT corpus (mixed-speech)PESQ1.35Lip2Wav
Speech RecognitionTCD-TIMIT corpus (mixed-speech)STOI0.558Lip2Wav
Speech RecognitionGRID corpus (mixed-speech)ESTOI0.535Lip2Wav
Speech RecognitionGRID corpus (mixed-speech)PESQ1.772Lip2Wav
Speech RecognitionGRID corpus (mixed-speech)STOI0.731Lip2Wav
Visual Speech RecognitionLRWESTOI0.344Lip2Wav
Visual Speech RecognitionLRWPESQ1.197Lip2Wav
Visual Speech RecognitionLRWSTOI0.543Lip2Wav
Visual Speech RecognitionLip2Wav (EH)ESTOI0.22Lip2Wav
Visual Speech RecognitionLip2Wav (EH)PESQ1.367Lip2Wav
Visual Speech RecognitionLip2Wav (EH)STOI0.369Lip2Wav
Visual Speech RecognitionLip2Wav (Chess)ESTOI0.29Lip2Wav
Visual Speech RecognitionLip2Wav (Chess)PESQ1.4Lip2Wav
Visual Speech RecognitionLip2Wav (Chess)STOI0.418Lip2Wav
Visual Speech RecognitionLip2Wav (DL)ESTOI0.183Lip2Wav
Visual Speech RecognitionLip2Wav (DL)PESQ1.671Lip2Wav
Visual Speech RecognitionLip2Wav (DL)STOI0.282Lip2Wav
Visual Speech RecognitionLip2Wav (HS)ESTOI0.311Lip2Wav
Visual Speech RecognitionLip2Wav (HS)PESQ1.29Lip2Wav
Visual Speech RecognitionLip2Wav (HS)STOI0.446Lip2Wav
Visual Speech RecognitionLip2Wav (Chem)ESTOI0.284Lip2Wav
Visual Speech RecognitionLip2Wav (Chem)PESQ1.3Lip2Wav
Visual Speech RecognitionLip2Wav (Chem)STOI0.416Lip2Wav
Visual Speech RecognitionTCD-TIMIT corpus (mixed-speech)ESTOI36.5Lip2Wav
Visual Speech RecognitionTCD-TIMIT corpus (mixed-speech)PESQ1.35Lip2Wav
Visual Speech RecognitionTCD-TIMIT corpus (mixed-speech)STOI0.558Lip2Wav
Visual Speech RecognitionGRID corpus (mixed-speech)ESTOI0.535Lip2Wav
Visual Speech RecognitionGRID corpus (mixed-speech)PESQ1.772Lip2Wav
Visual Speech RecognitionGRID corpus (mixed-speech)STOI0.731Lip2Wav
Lip to Speech SynthesisLRWESTOI0.344Lip2Wav
Lip to Speech SynthesisLRWPESQ1.197Lip2Wav
Lip to Speech SynthesisLRWSTOI0.543Lip2Wav
Lip to Speech SynthesisLip2Wav (EH)ESTOI0.22Lip2Wav
Lip to Speech SynthesisLip2Wav (EH)PESQ1.367Lip2Wav
Lip to Speech SynthesisLip2Wav (EH)STOI0.369Lip2Wav
Lip to Speech SynthesisLip2Wav (Chess)ESTOI0.29Lip2Wav
Lip to Speech SynthesisLip2Wav (Chess)PESQ1.4Lip2Wav
Lip to Speech SynthesisLip2Wav (Chess)STOI0.418Lip2Wav
Lip to Speech SynthesisLip2Wav (DL)ESTOI0.183Lip2Wav
Lip to Speech SynthesisLip2Wav (DL)PESQ1.671Lip2Wav
Lip to Speech SynthesisLip2Wav (DL)STOI0.282Lip2Wav
Lip to Speech SynthesisLip2Wav (HS)ESTOI0.311Lip2Wav
Lip to Speech SynthesisLip2Wav (HS)PESQ1.29Lip2Wav
Lip to Speech SynthesisLip2Wav (HS)STOI0.446Lip2Wav
Lip to Speech SynthesisLip2Wav (Chem)ESTOI0.284Lip2Wav
Lip to Speech SynthesisLip2Wav (Chem)PESQ1.3Lip2Wav
Lip to Speech SynthesisLip2Wav (Chem)STOI0.416Lip2Wav
Lip to Speech SynthesisTCD-TIMIT corpus (mixed-speech)ESTOI36.5Lip2Wav
Lip to Speech SynthesisTCD-TIMIT corpus (mixed-speech)PESQ1.35Lip2Wav
Lip to Speech SynthesisTCD-TIMIT corpus (mixed-speech)STOI0.558Lip2Wav
Lip to Speech SynthesisGRID corpus (mixed-speech)ESTOI0.535Lip2Wav
Lip to Speech SynthesisGRID corpus (mixed-speech)PESQ1.772Lip2Wav
Lip to Speech SynthesisGRID corpus (mixed-speech)STOI0.731Lip2Wav
Lip ReadingTCD-TIMIT corpus (mixed-speech)WER31.26Lip2Wav
Lip ReadingGRID corpus (mixed-speech)WER14.08Lip2Wav
Lip ReadingLRWWER34.2Lip2Wav

Related Papers

NonverbalTTS: A Public English Corpus of Text-Aligned Nonverbal Vocalizations with Emotion Annotations for Text-to-Speech2025-07-17Speech Quality Assessment Model Based on Mixture of Experts: System-Level Performance Enhancement and Utterance-Level Challenge Analysis2025-07-08VisualSpeaker: Visually-Guided 3D Avatar Lip Synthesis2025-07-08A Hybrid Machine Learning Framework for Optimizing Crop Selection via Agronomic and Economic Forecasting2025-07-06DeepGesture: A conversational gesture synthesis system based on emotions and semantics2025-07-03OpusLM: A Family of Open Unified Speech Language Models2025-06-21RapFlow-TTS: Rapid and High-Fidelity Text-to-Speech with Improved Consistency Flow Matching2025-06-20InstructTTSEval: Benchmarking Complex Natural-Language Instruction Following in Text-to-Speech Systems2025-06-19