TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/BigVSAN: Enhancing GAN-based Neural Vocoders with Slicing ...

BigVSAN: Enhancing GAN-based Neural Vocoders with Slicing Adversarial Network

Takashi Shibuya, Yuhta Takida, Yuki Mitsufuji

2023-09-06Speech Synthesis
PaperPDFCode(official)CodeCode(official)

Abstract

Generative adversarial network (GAN)-based vocoders have been intensively studied because they can synthesize high-fidelity audio waveforms faster than real-time. However, it has been reported that most GANs fail to obtain the optimal projection for discriminating between real and fake data in the feature space. In the literature, it has been demonstrated that slicing adversarial network (SAN), an improved GAN training framework that can find the optimal projection, is effective in the image generation task. In this paper, we investigate the effectiveness of SAN in the vocoding task. For this purpose, we propose a scheme to modify least-squares GAN, which most GAN-based vocoders adopt, so that their loss functions satisfy the requirements of SAN. Through our experiments, we demonstrate that SAN can improve the performance of GAN-based vocoders, including BigVGAN, with small modifications. Our code is available at https://github.com/sony/bigvsan.

Results

TaskDatasetMetricValueModel
Speech RecognitionLibriTTSM-STFT0.7992BigVSAN (w/ snakebeta)
Speech RecognitionLibriTTSMCD0.4129BigVSAN (w/ snakebeta)
Speech RecognitionLibriTTSPESQ4.12BigVSAN (w/ snakebeta)
Speech RecognitionLibriTTSPeriodicity0.0924BigVSAN (w/ snakebeta)
Speech RecognitionLibriTTSV/UV F10.9644BigVSAN (w/ snakebeta)
Speech RecognitionLibriTTSM-STFT0.7881BigVSAN
Speech RecognitionLibriTTSMCD0.3381BigVSAN
Speech RecognitionLibriTTSPESQ4.116BigVSAN
Speech RecognitionLibriTTSPeriodicity0.0935BigVSAN
Speech RecognitionLibriTTSV/UV F10.9635BigVSAN
Speech SynthesisLibriTTSM-STFT0.7992BigVSAN (w/ snakebeta)
Speech SynthesisLibriTTSMCD0.4129BigVSAN (w/ snakebeta)
Speech SynthesisLibriTTSPESQ4.12BigVSAN (w/ snakebeta)
Speech SynthesisLibriTTSPeriodicity0.0924BigVSAN (w/ snakebeta)
Speech SynthesisLibriTTSV/UV F10.9644BigVSAN (w/ snakebeta)
Speech SynthesisLibriTTSM-STFT0.7881BigVSAN
Speech SynthesisLibriTTSMCD0.3381BigVSAN
Speech SynthesisLibriTTSPESQ4.116BigVSAN
Speech SynthesisLibriTTSPeriodicity0.0935BigVSAN
Speech SynthesisLibriTTSV/UV F10.9635BigVSAN
Accented Speech RecognitionLibriTTSM-STFT0.7992BigVSAN (w/ snakebeta)
Accented Speech RecognitionLibriTTSMCD0.4129BigVSAN (w/ snakebeta)
Accented Speech RecognitionLibriTTSPESQ4.12BigVSAN (w/ snakebeta)
Accented Speech RecognitionLibriTTSPeriodicity0.0924BigVSAN (w/ snakebeta)
Accented Speech RecognitionLibriTTSV/UV F10.9644BigVSAN (w/ snakebeta)
Accented Speech RecognitionLibriTTSM-STFT0.7881BigVSAN
Accented Speech RecognitionLibriTTSMCD0.3381BigVSAN
Accented Speech RecognitionLibriTTSPESQ4.116BigVSAN
Accented Speech RecognitionLibriTTSPeriodicity0.0935BigVSAN
Accented Speech RecognitionLibriTTSV/UV F10.9635BigVSAN

Related Papers

NonverbalTTS: A Public English Corpus of Text-Aligned Nonverbal Vocalizations with Emotion Annotations for Text-to-Speech2025-07-17Speech Quality Assessment Model Based on Mixture of Experts: System-Level Performance Enhancement and Utterance-Level Challenge Analysis2025-07-08A Hybrid Machine Learning Framework for Optimizing Crop Selection via Agronomic and Economic Forecasting2025-07-06DeepGesture: A conversational gesture synthesis system based on emotions and semantics2025-07-03OpusLM: A Family of Open Unified Speech Language Models2025-06-21RapFlow-TTS: Rapid and High-Fidelity Text-to-Speech with Improved Consistency Flow Matching2025-06-20InstructTTSEval: Benchmarking Complex Natural-Language Instruction Following in Text-to-Speech Systems2025-06-19An accurate and revised version of optical character recognition-based speech synthesis using LabVIEW2025-06-18