TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Methods/HiFi-GAN

HiFi-GAN

Computer VisionIntroduced 200033 papers
Source Paper

Description

HiFi-GAN is a generative adversarial network for speech synthesis. HiFi-GAN consists of one generator and two discriminators: multi-scale and multi-period discriminators. The generator and discriminators are trained adversarially, along with two additional losses for improving training stability and model performance.

The generator is a fully convolutional neural network. It uses a mel-spectrogram as input and upsamples it through transposed convolutions until the length of the output sequence matches the temporal resolution of raw waveforms. Every transposed convolution is followed by a multi-receptive field fusion (MRF) module.

For the discriminator, a multi-period discriminator (MPD) is used consisting of several sub-discriminators each handling a portion of periodic signals of input audio. Additionally, to capture consecutive patterns and long-term dependencies, the multi-scale discriminator (MSD) proposed in MelGAN is used, which consecutively evaluates audio samples at different levels.

Papers Using This Method

RingFormer: A Neural Vocoder with Ring Attention and Convolution-Augmented Transformer2025-01-02A Unified Model For Voice and Accent Conversion In Speech and Singing using Self-Supervised Learning and Feature Extraction2024-12-11TSELM: Target Speaker Extraction using Discrete Tokens and Language Models2024-09-12DSP-informed bandwidth extension using locally-conditioned excitation and linear time-varying filter subnetworks2024-07-22StreamSpeech: Simultaneous Speech-to-Speech Translation with Multi-task Learning2024-06-05CoVoMix: Advancing Zero-Shot Speech Generation for Human-like Multi-talker Conversations2024-04-10SpecDiff-GAN: A Spectrally-Shaped Noise Diffusion GAN for Speech and Music Synthesis2024-01-30UNIT-DSR: Dysarthric Speech Reconstruction System Using Speech Unit Normalization2024-01-26Scaling NVIDIA's Multi-speaker Multi-lingual TTS Systems with Zero-Shot TTS to Indic Languages2024-01-24SELM: Speech Enhancement Using Discrete Tokens and Language Models2023-12-15APNet2: High-quality and High-efficiency Neural Vocoder with Direct Prediction of Amplitude and Phase Spectra2023-11-20Collaborative Watermarking for Adversarial Speech Synthesis2023-09-26HiFTNet: A Fast High-Quality Neural Vocoder with Harmonic-plus-Noise Filter and Inverse Short Time Fourier Transform2023-09-18Rep2wav: Noise Robust text-to-speech Using self-supervised representations2023-08-28MusicLDM: Enhancing Novelty in Text-to-Music Generation Using Beat-Synchronous Mixup Strategies2023-08-03Speaker-independent neural formant synthesis2023-06-02Source-Filter-Based Generative Adversarial Neural Vocoder for High Fidelity Speech Synthesis2023-04-26Wave-U-Net Discriminator: Fast and Lightweight Discriminator for Generative Adversarial Network-Based Speech Synthesis2023-03-24Self-Supervised Representations for Singing Voice Conversion2023-03-21Fast and small footprint Hybrid HMM-HiFiGAN based system for speech synthesis in Indian languages2023-02-13