BigVSAN: Enhancing GAN-based Neural Vocoders with Slicing Adversarial Network

Takashi Shibuya, Yuhta Takida, Yuki Mitsufuji

2023-09-06Speech Synthesis

Paper PDF Code(official)Code Code(official)

Abstract

Generative adversarial network (GAN)-based vocoders have been intensively studied because they can synthesize high-fidelity audio waveforms faster than real-time. However, it has been reported that most GANs fail to obtain the optimal projection for discriminating between real and fake data in the feature space. In the literature, it has been demonstrated that slicing adversarial network (SAN), an improved GAN training framework that can find the optimal projection, is effective in the image generation task. In this paper, we investigate the effectiveness of SAN in the vocoding task. For this purpose, we propose a scheme to modify least-squares GAN, which most GAN-based vocoders adopt, so that their loss functions satisfy the requirements of SAN. Through our experiments, we demonstrate that SAN can improve the performance of GAN-based vocoders, including BigVGAN, with small modifications. Our code is available at https://github.com/sony/bigvsan.

Results

Task	Dataset	Metric	Value	Model
Speech Recognition	LibriTTS	M-STFT	0.7992	BigVSAN (w/ snakebeta)
Speech Recognition	LibriTTS	MCD	0.4129	BigVSAN (w/ snakebeta)
Speech Recognition	LibriTTS	PESQ	4.12	BigVSAN (w/ snakebeta)
Speech Recognition	LibriTTS	Periodicity	0.0924	BigVSAN (w/ snakebeta)
Speech Recognition	LibriTTS	V/UV F1	0.9644	BigVSAN (w/ snakebeta)
Speech Recognition	LibriTTS	M-STFT	0.7881	BigVSAN
Speech Recognition	LibriTTS	MCD	0.3381	BigVSAN
Speech Recognition	LibriTTS	PESQ	4.116	BigVSAN
Speech Recognition	LibriTTS	Periodicity	0.0935	BigVSAN
Speech Recognition	LibriTTS	V/UV F1	0.9635	BigVSAN
Speech Synthesis	LibriTTS	M-STFT	0.7992	BigVSAN (w/ snakebeta)
Speech Synthesis	LibriTTS	MCD	0.4129	BigVSAN (w/ snakebeta)
Speech Synthesis	LibriTTS	PESQ	4.12	BigVSAN (w/ snakebeta)
Speech Synthesis	LibriTTS	Periodicity	0.0924	BigVSAN (w/ snakebeta)
Speech Synthesis	LibriTTS	V/UV F1	0.9644	BigVSAN (w/ snakebeta)
Speech Synthesis	LibriTTS	M-STFT	0.7881	BigVSAN
Speech Synthesis	LibriTTS	MCD	0.3381	BigVSAN
Speech Synthesis	LibriTTS	PESQ	4.116	BigVSAN
Speech Synthesis	LibriTTS	Periodicity	0.0935	BigVSAN
Speech Synthesis	LibriTTS	V/UV F1	0.9635	BigVSAN
Accented Speech Recognition	LibriTTS	M-STFT	0.7992	BigVSAN (w/ snakebeta)
Accented Speech Recognition	LibriTTS	MCD	0.4129	BigVSAN (w/ snakebeta)
Accented Speech Recognition	LibriTTS	PESQ	4.12	BigVSAN (w/ snakebeta)
Accented Speech Recognition	LibriTTS	Periodicity	0.0924	BigVSAN (w/ snakebeta)
Accented Speech Recognition	LibriTTS	V/UV F1	0.9644	BigVSAN (w/ snakebeta)
Accented Speech Recognition	LibriTTS	M-STFT	0.7881	BigVSAN
Accented Speech Recognition	LibriTTS	MCD	0.3381	BigVSAN
Accented Speech Recognition	LibriTTS	PESQ	4.116	BigVSAN
Accented Speech Recognition	LibriTTS	Periodicity	0.0935	BigVSAN
Accented Speech Recognition	LibriTTS	V/UV F1	0.9635	BigVSAN

BigVSAN: Enhancing GAN-based Neural Vocoders with Slicing Adversarial Network

Abstract

Results

Related Papers

BigVSAN: Enhancing GAN-based Neural Vocoders with Slicing Adversarial Network

Abstract

Results

Related Papers