HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

Jungil Kong, Jaehyeon Kim, Jaekyoung Bae

2020-10-12NeurIPS 2020) 2020 10Speech Synthesis

Paper PDF Code Code Code Code Code(official)Code Code Code Code Code Code

Abstract

Several recent work on speech synthesis have employed generative adversarial networks (GANs) to produce raw waveforms. Although such methods improve the sampling efficiency and memory usage, their sample quality has not yet reached that of autoregressive and flow-based generative models. In this work, we propose HiFi-GAN, which achieves both efficient and high-fidelity speech synthesis. As speech audio consists of sinusoidal signals with various periods, we demonstrate that modeling periodic patterns of an audio is crucial for enhancing sample quality. A subjective human evaluation (mean opinion score, MOS) of a single speaker dataset indicates that our proposed method demonstrates similarity to human quality while generating 22.05 kHz high-fidelity audio 167.9 times faster than real-time on a single V100 GPU. We further show the generality of HiFi-GAN to the mel-spectrogram inversion of unseen speakers and end-to-end speech synthesis. Finally, a small footprint version of HiFi-GAN generates samples 13.4 times faster than real-time on CPU with comparable quality to an autoregressive counterpart.

Results

Task	Dataset	Metric	Value	Model
Speech Recognition	LibriTTS	M-STFT	1.0017	HiFi-GAN
Speech Recognition	LibriTTS	MCD	0.6603	HiFi-GAN
Speech Recognition	LibriTTS	PESQ	2.947	HiFi-GAN
Speech Recognition	LibriTTS	Periodicity	0.1565	HiFi-GAN
Speech Recognition	LibriTTS	V/UV F1	0.93	HiFi-GAN
Speech Synthesis	LibriTTS	M-STFT	1.0017	HiFi-GAN
Speech Synthesis	LibriTTS	MCD	0.6603	HiFi-GAN
Speech Synthesis	LibriTTS	PESQ	2.947	HiFi-GAN
Speech Synthesis	LibriTTS	Periodicity	0.1565	HiFi-GAN
Speech Synthesis	LibriTTS	V/UV F1	0.93	HiFi-GAN
Accented Speech Recognition	LibriTTS	M-STFT	1.0017	HiFi-GAN
Accented Speech Recognition	LibriTTS	MCD	0.6603	HiFi-GAN
Accented Speech Recognition	LibriTTS	PESQ	2.947	HiFi-GAN
Accented Speech Recognition	LibriTTS	Periodicity	0.1565	HiFi-GAN
Accented Speech Recognition	LibriTTS	V/UV F1	0.93	HiFi-GAN

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

Abstract

Results

Related Papers

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

Abstract

Results

Related Papers