TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/HiFi-GAN: Generative Adversarial Networks for Efficient an...

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

Jungil Kong, Jaehyeon Kim, Jaekyoung Bae

2020-10-12NeurIPS 2020) 2020 10Speech Synthesis
PaperPDFCodeCodeCodeCodeCode(official)CodeCodeCodeCodeCodeCode

Abstract

Several recent work on speech synthesis have employed generative adversarial networks (GANs) to produce raw waveforms. Although such methods improve the sampling efficiency and memory usage, their sample quality has not yet reached that of autoregressive and flow-based generative models. In this work, we propose HiFi-GAN, which achieves both efficient and high-fidelity speech synthesis. As speech audio consists of sinusoidal signals with various periods, we demonstrate that modeling periodic patterns of an audio is crucial for enhancing sample quality. A subjective human evaluation (mean opinion score, MOS) of a single speaker dataset indicates that our proposed method demonstrates similarity to human quality while generating 22.05 kHz high-fidelity audio 167.9 times faster than real-time on a single V100 GPU. We further show the generality of HiFi-GAN to the mel-spectrogram inversion of unseen speakers and end-to-end speech synthesis. Finally, a small footprint version of HiFi-GAN generates samples 13.4 times faster than real-time on CPU with comparable quality to an autoregressive counterpart.

Results

TaskDatasetMetricValueModel
Speech RecognitionLibriTTSM-STFT1.0017HiFi-GAN
Speech RecognitionLibriTTSMCD0.6603HiFi-GAN
Speech RecognitionLibriTTSPESQ2.947HiFi-GAN
Speech RecognitionLibriTTSPeriodicity0.1565HiFi-GAN
Speech RecognitionLibriTTSV/UV F10.93HiFi-GAN
Speech SynthesisLibriTTSM-STFT1.0017HiFi-GAN
Speech SynthesisLibriTTSMCD0.6603HiFi-GAN
Speech SynthesisLibriTTSPESQ2.947HiFi-GAN
Speech SynthesisLibriTTSPeriodicity0.1565HiFi-GAN
Speech SynthesisLibriTTSV/UV F10.93HiFi-GAN
Accented Speech RecognitionLibriTTSM-STFT1.0017HiFi-GAN
Accented Speech RecognitionLibriTTSMCD0.6603HiFi-GAN
Accented Speech RecognitionLibriTTSPESQ2.947HiFi-GAN
Accented Speech RecognitionLibriTTSPeriodicity0.1565HiFi-GAN
Accented Speech RecognitionLibriTTSV/UV F10.93HiFi-GAN

Related Papers

NonverbalTTS: A Public English Corpus of Text-Aligned Nonverbal Vocalizations with Emotion Annotations for Text-to-Speech2025-07-17Speech Quality Assessment Model Based on Mixture of Experts: System-Level Performance Enhancement and Utterance-Level Challenge Analysis2025-07-08A Hybrid Machine Learning Framework for Optimizing Crop Selection via Agronomic and Economic Forecasting2025-07-06DeepGesture: A conversational gesture synthesis system based on emotions and semantics2025-07-03OpusLM: A Family of Open Unified Speech Language Models2025-06-21RapFlow-TTS: Rapid and High-Fidelity Text-to-Speech with Improved Consistency Flow Matching2025-06-20InstructTTSEval: Benchmarking Complex Natural-Language Instruction Following in Text-to-Speech Systems2025-06-19An accurate and revised version of optical character recognition-based speech synthesis using LabVIEW2025-06-18