TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Vocos: Closing the gap between time-domain and Fourier-bas...

Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis

Hubert Siuzdak

2023-06-01Speech SynthesisAudio Synthesis
PaperPDFCodeCode(official)CodeCode

Abstract

Recent advancements in neural vocoding are predominantly driven by Generative Adversarial Networks (GANs) operating in the time-domain. While effective, this approach neglects the inductive bias offered by time-frequency representations, resulting in reduntant and computionally-intensive upsampling operations. Fourier-based time-frequency representation is an appealing alternative, aligning more accurately with human auditory perception, and benefitting from well-established fast algorithms for its computation. Nevertheless, direct reconstruction of complex-valued spectrograms has been historically problematic, primarily due to phase recovery issues. This study seeks to close this gap by presenting Vocos, a new model that directly generates Fourier spectral coefficients. Vocos not only matches the state-of-the-art in audio quality, as demonstrated in our evaluations, but it also substantially improves computational efficiency, achieving an order of magnitude increase in speed compared to prevailing time-domain neural vocoding approaches. The source code and model weights have been open-sourced at https://github.com/gemelo-ai/vocos.

Results

TaskDatasetMetricValueModel
Speech RecognitionLibriTTSPESQ3.7Vocos
Speech RecognitionLibriTTSPeriodicity0.101Vocos
Speech RecognitionLibriTTSV/UV F10.9582Vocos
Speech SynthesisLibriTTSPESQ3.7Vocos
Speech SynthesisLibriTTSPeriodicity0.101Vocos
Speech SynthesisLibriTTSV/UV F10.9582Vocos
Accented Speech RecognitionLibriTTSPESQ3.7Vocos
Accented Speech RecognitionLibriTTSPeriodicity0.101Vocos
Accented Speech RecognitionLibriTTSV/UV F10.9582Vocos

Related Papers

NonverbalTTS: A Public English Corpus of Text-Aligned Nonverbal Vocalizations with Emotion Annotations for Text-to-Speech2025-07-17MIDI-VALLE: Improving Expressive Piano Performance Synthesis Through Neural Codec Language Modelling2025-07-11Speech Quality Assessment Model Based on Mixture of Experts: System-Level Performance Enhancement and Utterance-Level Challenge Analysis2025-07-08A Hybrid Machine Learning Framework for Optimizing Crop Selection via Agronomic and Economic Forecasting2025-07-06DeepGesture: A conversational gesture synthesis system based on emotions and semantics2025-07-03Step-by-Step Video-to-Audio Synthesis via Negative Audio Guidance2025-06-26OpusLM: A Family of Open Unified Speech Language Models2025-06-21RapFlow-TTS: Rapid and High-Fidelity Text-to-Speech with Improved Consistency Flow Matching2025-06-20