TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/BigVGAN: A Universal Neural Vocoder with Large-Scale Train...

BigVGAN: A Universal Neural Vocoder with Large-Scale Training

Sang-gil Lee, Wei Ping, Boris Ginsburg, Bryan Catanzaro, Sungroh Yoon

2022-06-09Music GenerationAudio GenerationSpeech SynthesisAudio Synthesis
PaperPDFCode(official)CodeCodeCodeCode

Abstract

Despite recent progress in generative adversarial network (GAN)-based vocoders, where the model generates raw waveform conditioned on acoustic features, it is challenging to synthesize high-fidelity audio for numerous speakers across various recording environments. In this work, we present BigVGAN, a universal vocoder that generalizes well for various out-of-distribution scenarios without fine-tuning. We introduce periodic activation function and anti-aliased representation into the GAN generator, which brings the desired inductive bias for audio synthesis and significantly improves audio quality. In addition, we train our GAN vocoder at the largest scale up to 112M parameters, which is unprecedented in the literature. We identify and address the failure modes in large-scale GAN training for audio, while maintaining high-fidelity output without over-regularization. Our BigVGAN, trained only on clean speech (LibriTTS), achieves the state-of-the-art performance for various zero-shot (out-of-distribution) conditions, including unseen speakers, languages, recording environments, singing voices, music, and instrumental audio. We release our code and model at: https://github.com/NVIDIA/BigVGAN

Results

TaskDatasetMetricValueModel
Speech RecognitionLibriTTSM-STFT0.7026BigVGAN-v2
Speech RecognitionLibriTTSMCD0.2903BigVGAN-v2
Speech RecognitionLibriTTSPESQ4.362BigVGAN-v2
Speech RecognitionLibriTTSPeriodicity0.0593BigVGAN-v2
Speech RecognitionLibriTTSV/UV F10.9793BigVGAN-v2
Speech RecognitionLibriTTSM-STFT0.7997BigVGAN
Speech RecognitionLibriTTSMCD0.3745BigVGAN
Speech RecognitionLibriTTSPESQ4.027BigVGAN
Speech RecognitionLibriTTSPeriodicity0.1018BigVGAN
Speech RecognitionLibriTTSV/UV F10.9598BigVGAN
Speech RecognitionLibriTTSM-STFT0.8788BigVGAN-base
Speech RecognitionLibriTTSMCD0.4564BigVGAN-base
Speech RecognitionLibriTTSPESQ3.519BigVGAN-base
Speech RecognitionLibriTTSPeriodicity0.1287BigVGAN-base
Speech RecognitionLibriTTSV/UV F10.9459BigVGAN-base
Speech SynthesisLibriTTSM-STFT0.7026BigVGAN-v2
Speech SynthesisLibriTTSMCD0.2903BigVGAN-v2
Speech SynthesisLibriTTSPESQ4.362BigVGAN-v2
Speech SynthesisLibriTTSPeriodicity0.0593BigVGAN-v2
Speech SynthesisLibriTTSV/UV F10.9793BigVGAN-v2
Speech SynthesisLibriTTSM-STFT0.7997BigVGAN
Speech SynthesisLibriTTSMCD0.3745BigVGAN
Speech SynthesisLibriTTSPESQ4.027BigVGAN
Speech SynthesisLibriTTSPeriodicity0.1018BigVGAN
Speech SynthesisLibriTTSV/UV F10.9598BigVGAN
Speech SynthesisLibriTTSM-STFT0.8788BigVGAN-base
Speech SynthesisLibriTTSMCD0.4564BigVGAN-base
Speech SynthesisLibriTTSPESQ3.519BigVGAN-base
Speech SynthesisLibriTTSPeriodicity0.1287BigVGAN-base
Speech SynthesisLibriTTSV/UV F10.9459BigVGAN-base
Accented Speech RecognitionLibriTTSM-STFT0.7026BigVGAN-v2
Accented Speech RecognitionLibriTTSMCD0.2903BigVGAN-v2
Accented Speech RecognitionLibriTTSPESQ4.362BigVGAN-v2
Accented Speech RecognitionLibriTTSPeriodicity0.0593BigVGAN-v2
Accented Speech RecognitionLibriTTSV/UV F10.9793BigVGAN-v2
Accented Speech RecognitionLibriTTSM-STFT0.7997BigVGAN
Accented Speech RecognitionLibriTTSMCD0.3745BigVGAN
Accented Speech RecognitionLibriTTSPESQ4.027BigVGAN
Accented Speech RecognitionLibriTTSPeriodicity0.1018BigVGAN
Accented Speech RecognitionLibriTTSV/UV F10.9598BigVGAN
Accented Speech RecognitionLibriTTSM-STFT0.8788BigVGAN-base
Accented Speech RecognitionLibriTTSMCD0.4564BigVGAN-base
Accented Speech RecognitionLibriTTSPESQ3.519BigVGAN-base
Accented Speech RecognitionLibriTTSPeriodicity0.1287BigVGAN-base
Accented Speech RecognitionLibriTTSV/UV F10.9459BigVGAN-base

Related Papers

NonverbalTTS: A Public English Corpus of Text-Aligned Nonverbal Vocalizations with Emotion Annotations for Text-to-Speech2025-07-17WildFX: A DAW-Powered Pipeline for In-the-Wild Audio FX Graph Modeling2025-07-14FreeAudio: Training-Free Timing Planning for Controllable Long-Form Text-to-Audio Generation2025-07-11MIDI-VALLE: Improving Expressive Piano Performance Synthesis Through Neural Codec Language Modelling2025-07-11MusiScene: Leveraging MU-LLaMA for Scene Imagination and Enhanced Video Background Music Generation2025-07-08Speech Quality Assessment Model Based on Mixture of Experts: System-Level Performance Enhancement and Utterance-Level Challenge Analysis2025-07-08A Hybrid Machine Learning Framework for Optimizing Crop Selection via Agronomic and Economic Forecasting2025-07-06DeepGesture: A conversational gesture synthesis system based on emotions and semantics2025-07-03