Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis

K R Prajwal, Rudrabha Mukhopadhyay, Vinay Namboodiri, C. V. Jawahar

2020-05-17CVPR 2020 6Speech Synthesis Lip Reading Speaker-Specific Lip to Speech Synthesis Lip to Speech Synthesis

Abstract

Humans involuntarily tend to infer parts of the conversation from lip movements when the speech is absent or corrupted by external noise. In this work, we explore the task of lip to speech synthesis, i.e., learning to generate natural speech given only the lip movements of a speaker. Acknowledging the importance of contextual and speaker-specific cues for accurate lip-reading, we take a different path from existing works. We focus on learning accurate lip sequences to speech mappings for individual speakers in unconstrained, large vocabulary settings. To this end, we collect and release a large-scale benchmark dataset, the first of its kind, specifically to train and evaluate the single-speaker lip to speech task in natural settings. We propose a novel approach with key design choices to achieve accurate, natural lip to speech synthesis in such unconstrained scenarios for the first time. Extensive evaluation using quantitative, qualitative metrics and human evaluation shows that our method is four times more intelligible than previous works in this space. Please check out our demo video for a quick overview of the paper, method, and qualitative results. https://www.youtube.com/watch?v=HziA-jmlk_4&feature=youtu.be

Results

Task	Dataset	Metric	Value	Model
Speech Recognition	LRW	ESTOI	0.344	Lip2Wav
Speech Recognition	LRW	PESQ	1.197	Lip2Wav
Speech Recognition	LRW	STOI	0.543	Lip2Wav
Speech Recognition	Lip2Wav (EH)	ESTOI	0.22	Lip2Wav
Speech Recognition	Lip2Wav (EH)	PESQ	1.367	Lip2Wav
Speech Recognition	Lip2Wav (EH)	STOI	0.369	Lip2Wav
Speech Recognition	Lip2Wav (Chess)	ESTOI	0.29	Lip2Wav
Speech Recognition	Lip2Wav (Chess)	PESQ	1.4	Lip2Wav
Speech Recognition	Lip2Wav (Chess)	STOI	0.418	Lip2Wav
Speech Recognition	Lip2Wav (DL)	ESTOI	0.183	Lip2Wav
Speech Recognition	Lip2Wav (DL)	PESQ	1.671	Lip2Wav
Speech Recognition	Lip2Wav (DL)	STOI	0.282	Lip2Wav
Speech Recognition	Lip2Wav (HS)	ESTOI	0.311	Lip2Wav
Speech Recognition	Lip2Wav (HS)	PESQ	1.29	Lip2Wav
Speech Recognition	Lip2Wav (HS)	STOI	0.446	Lip2Wav
Speech Recognition	Lip2Wav (Chem)	ESTOI	0.284	Lip2Wav
Speech Recognition	Lip2Wav (Chem)	PESQ	1.3	Lip2Wav
Speech Recognition	Lip2Wav (Chem)	STOI	0.416	Lip2Wav
Speech Recognition	TCD-TIMIT corpus (mixed-speech)	ESTOI	36.5	Lip2Wav
Speech Recognition	TCD-TIMIT corpus (mixed-speech)	PESQ	1.35	Lip2Wav
Speech Recognition	TCD-TIMIT corpus (mixed-speech)	STOI	0.558	Lip2Wav
Speech Recognition	GRID corpus (mixed-speech)	ESTOI	0.535	Lip2Wav
Speech Recognition	GRID corpus (mixed-speech)	PESQ	1.772	Lip2Wav
Speech Recognition	GRID corpus (mixed-speech)	STOI	0.731	Lip2Wav
Visual Speech Recognition	LRW	ESTOI	0.344	Lip2Wav
Visual Speech Recognition	LRW	PESQ	1.197	Lip2Wav
Visual Speech Recognition	LRW	STOI	0.543	Lip2Wav
Visual Speech Recognition	Lip2Wav (EH)	ESTOI	0.22	Lip2Wav
Visual Speech Recognition	Lip2Wav (EH)	PESQ	1.367	Lip2Wav
Visual Speech Recognition	Lip2Wav (EH)	STOI	0.369	Lip2Wav
Visual Speech Recognition	Lip2Wav (Chess)	ESTOI	0.29	Lip2Wav
Visual Speech Recognition	Lip2Wav (Chess)	PESQ	1.4	Lip2Wav
Visual Speech Recognition	Lip2Wav (Chess)	STOI	0.418	Lip2Wav
Visual Speech Recognition	Lip2Wav (DL)	ESTOI	0.183	Lip2Wav
Visual Speech Recognition	Lip2Wav (DL)	PESQ	1.671	Lip2Wav
Visual Speech Recognition	Lip2Wav (DL)	STOI	0.282	Lip2Wav
Visual Speech Recognition	Lip2Wav (HS)	ESTOI	0.311	Lip2Wav
Visual Speech Recognition	Lip2Wav (HS)	PESQ	1.29	Lip2Wav
Visual Speech Recognition	Lip2Wav (HS)	STOI	0.446	Lip2Wav
Visual Speech Recognition	Lip2Wav (Chem)	ESTOI	0.284	Lip2Wav
Visual Speech Recognition	Lip2Wav (Chem)	PESQ	1.3	Lip2Wav
Visual Speech Recognition	Lip2Wav (Chem)	STOI	0.416	Lip2Wav
Visual Speech Recognition	TCD-TIMIT corpus (mixed-speech)	ESTOI	36.5	Lip2Wav
Visual Speech Recognition	TCD-TIMIT corpus (mixed-speech)	PESQ	1.35	Lip2Wav
Visual Speech Recognition	TCD-TIMIT corpus (mixed-speech)	STOI	0.558	Lip2Wav
Visual Speech Recognition	GRID corpus (mixed-speech)	ESTOI	0.535	Lip2Wav
Visual Speech Recognition	GRID corpus (mixed-speech)	PESQ	1.772	Lip2Wav
Visual Speech Recognition	GRID corpus (mixed-speech)	STOI	0.731	Lip2Wav
Lip to Speech Synthesis	LRW	ESTOI	0.344	Lip2Wav
Lip to Speech Synthesis	LRW	PESQ	1.197	Lip2Wav
Lip to Speech Synthesis	LRW	STOI	0.543	Lip2Wav
Lip to Speech Synthesis	Lip2Wav (EH)	ESTOI	0.22	Lip2Wav
Lip to Speech Synthesis	Lip2Wav (EH)	PESQ	1.367	Lip2Wav
Lip to Speech Synthesis	Lip2Wav (EH)	STOI	0.369	Lip2Wav
Lip to Speech Synthesis	Lip2Wav (Chess)	ESTOI	0.29	Lip2Wav
Lip to Speech Synthesis	Lip2Wav (Chess)	PESQ	1.4	Lip2Wav
Lip to Speech Synthesis	Lip2Wav (Chess)	STOI	0.418	Lip2Wav
Lip to Speech Synthesis	Lip2Wav (DL)	ESTOI	0.183	Lip2Wav
Lip to Speech Synthesis	Lip2Wav (DL)	PESQ	1.671	Lip2Wav
Lip to Speech Synthesis	Lip2Wav (DL)	STOI	0.282	Lip2Wav
Lip to Speech Synthesis	Lip2Wav (HS)	ESTOI	0.311	Lip2Wav
Lip to Speech Synthesis	Lip2Wav (HS)	PESQ	1.29	Lip2Wav
Lip to Speech Synthesis	Lip2Wav (HS)	STOI	0.446	Lip2Wav
Lip to Speech Synthesis	Lip2Wav (Chem)	ESTOI	0.284	Lip2Wav
Lip to Speech Synthesis	Lip2Wav (Chem)	PESQ	1.3	Lip2Wav
Lip to Speech Synthesis	Lip2Wav (Chem)	STOI	0.416	Lip2Wav
Lip to Speech Synthesis	TCD-TIMIT corpus (mixed-speech)	ESTOI	36.5	Lip2Wav
Lip to Speech Synthesis	TCD-TIMIT corpus (mixed-speech)	PESQ	1.35	Lip2Wav
Lip to Speech Synthesis	TCD-TIMIT corpus (mixed-speech)	STOI	0.558	Lip2Wav
Lip to Speech Synthesis	GRID corpus (mixed-speech)	ESTOI	0.535	Lip2Wav
Lip to Speech Synthesis	GRID corpus (mixed-speech)	PESQ	1.772	Lip2Wav
Lip to Speech Synthesis	GRID corpus (mixed-speech)	STOI	0.731	Lip2Wav
Lip Reading	TCD-TIMIT corpus (mixed-speech)	WER	31.26	Lip2Wav
Lip Reading	GRID corpus (mixed-speech)	WER	14.08	Lip2Wav
Lip Reading	LRW	WER	34.2	Lip2Wav

Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis

Abstract

Results

Related Papers

Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis

Abstract

Results

Related Papers