TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/EVA-GAN: Enhanced Various Audio Generation via Scalable Ge...

EVA-GAN: Enhanced Various Audio Generation via Scalable Generative Adversarial Networks

Shijia Liao, Shiyi Lan, Arun George Zachariah

2024-01-31Audio GenerationSpeech Synthesis
PaperPDFCode

Abstract

The advent of Large Models marks a new era in machine learning, significantly outperforming smaller models by leveraging vast datasets to capture and synthesize complex patterns. Despite these advancements, the exploration into scaling, especially in the audio generation domain, remains limited, with previous efforts didn't extend into the high-fidelity (HiFi) 44.1kHz domain and suffering from both spectral discontinuities and blurriness in the high-frequency domain, alongside a lack of robustness against out-of-domain data. These limitations restrict the applicability of models to diverse use cases, including music and singing generation. Our work introduces Enhanced Various Audio Generation via Scalable Generative Adversarial Networks (EVA-GAN), yields significant improvements over previous state-of-the-art in spectral and high-frequency reconstruction and robustness in out-of-domain data performance, enabling the generation of HiFi audios by employing an extensive dataset of 36,000 hours of 44.1kHz audio, a context-aware module, a Human-In-The-Loop artifact measurement toolkit, and expands the model to approximately 200 million parameters. Demonstrations of our work are available at https://double-blind-eva-gan.cc.

Results

TaskDatasetMetricValueModel
Speech RecognitionLibriTTSM-STFT0.7982EVA-GAN-big
Speech RecognitionLibriTTSPESQ4.3536EVA-GAN-big
Speech RecognitionLibriTTSPeriodicity0.0751EVA-GAN-big
Speech RecognitionLibriTTSV/UV F10.9745EVA-GAN-big
Speech RecognitionLibriTTSM-STFT0.9485EVA-GAN-base
Speech RecognitionLibriTTSPESQ4.033EVA-GAN-base
Speech RecognitionLibriTTSPeriodicity0.0942EVA-GAN-base
Speech RecognitionLibriTTSV/UV F10.9658EVA-GAN-base
Speech SynthesisLibriTTSM-STFT0.7982EVA-GAN-big
Speech SynthesisLibriTTSPESQ4.3536EVA-GAN-big
Speech SynthesisLibriTTSPeriodicity0.0751EVA-GAN-big
Speech SynthesisLibriTTSV/UV F10.9745EVA-GAN-big
Speech SynthesisLibriTTSM-STFT0.9485EVA-GAN-base
Speech SynthesisLibriTTSPESQ4.033EVA-GAN-base
Speech SynthesisLibriTTSPeriodicity0.0942EVA-GAN-base
Speech SynthesisLibriTTSV/UV F10.9658EVA-GAN-base
Accented Speech RecognitionLibriTTSM-STFT0.7982EVA-GAN-big
Accented Speech RecognitionLibriTTSPESQ4.3536EVA-GAN-big
Accented Speech RecognitionLibriTTSPeriodicity0.0751EVA-GAN-big
Accented Speech RecognitionLibriTTSV/UV F10.9745EVA-GAN-big
Accented Speech RecognitionLibriTTSM-STFT0.9485EVA-GAN-base
Accented Speech RecognitionLibriTTSPESQ4.033EVA-GAN-base
Accented Speech RecognitionLibriTTSPeriodicity0.0942EVA-GAN-base
Accented Speech RecognitionLibriTTSV/UV F10.9658EVA-GAN-base

Related Papers

NonverbalTTS: A Public English Corpus of Text-Aligned Nonverbal Vocalizations with Emotion Annotations for Text-to-Speech2025-07-17FreeAudio: Training-Free Timing Planning for Controllable Long-Form Text-to-Audio Generation2025-07-11Speech Quality Assessment Model Based on Mixture of Experts: System-Level Performance Enhancement and Utterance-Level Challenge Analysis2025-07-08A Hybrid Machine Learning Framework for Optimizing Crop Selection via Agronomic and Economic Forecasting2025-07-06DeepGesture: A conversational gesture synthesis system based on emotions and semantics2025-07-03ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing2025-06-26Step-by-Step Video-to-Audio Synthesis via Negative Audio Guidance2025-06-26Kling-Foley: Multimodal Diffusion Transformer for High-Quality Video-to-Audio Generation2025-06-24