TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/WaveFlow: A Compact Flow-based Model for Raw Audio

WaveFlow: A Compact Flow-based Model for Raw Audio

Wei Ping, Kainan Peng, Kexin Zhao, Zhao Song

2019-12-03ICML 2020 1Speech Synthesis
PaperPDFCodeCodeCode(official)Code

Abstract

In this work, we propose WaveFlow, a small-footprint generative flow for raw audio, which is directly trained with maximum likelihood. It handles the long-range structure of 1-D waveform with a dilated 2-D convolutional architecture, while modeling the local variations using expressive autoregressive functions. WaveFlow provides a unified view of likelihood-based models for 1-D data, including WaveNet and WaveGlow as special cases. It generates high-fidelity speech as WaveNet, while synthesizing several orders of magnitude faster as it only requires a few sequential steps to generate very long waveforms with hundreds of thousands of time-steps. Furthermore, it can significantly reduce the likelihood gap that has existed between autoregressive models and flow-based models for efficient synthesis. Finally, our small-footprint WaveFlow has only 5.91M parameters, which is 15$\times$ smaller than WaveGlow. It can generate 22.05 kHz high-fidelity audio 42.6$\times$ faster than real-time (at a rate of 939.3 kHz) on a V100 GPU without engineered inference kernels.

Results

TaskDatasetMetricValueModel
Speech RecognitionLibriTTSM-STFT1.112WaveFlow
Speech RecognitionLibriTTSMCD1.2455WaveFlow
Speech RecognitionLibriTTSPESQ3.027WaveFlow
Speech RecognitionLibriTTSPeriodicity0.1416WaveFlow
Speech RecognitionLibriTTSV/UV F10.941WaveFlow
Speech SynthesisLibriTTSM-STFT1.112WaveFlow
Speech SynthesisLibriTTSMCD1.2455WaveFlow
Speech SynthesisLibriTTSPESQ3.027WaveFlow
Speech SynthesisLibriTTSPeriodicity0.1416WaveFlow
Speech SynthesisLibriTTSV/UV F10.941WaveFlow
Accented Speech RecognitionLibriTTSM-STFT1.112WaveFlow
Accented Speech RecognitionLibriTTSMCD1.2455WaveFlow
Accented Speech RecognitionLibriTTSPESQ3.027WaveFlow
Accented Speech RecognitionLibriTTSPeriodicity0.1416WaveFlow
Accented Speech RecognitionLibriTTSV/UV F10.941WaveFlow

Related Papers

NonverbalTTS: A Public English Corpus of Text-Aligned Nonverbal Vocalizations with Emotion Annotations for Text-to-Speech2025-07-17Speech Quality Assessment Model Based on Mixture of Experts: System-Level Performance Enhancement and Utterance-Level Challenge Analysis2025-07-08A Hybrid Machine Learning Framework for Optimizing Crop Selection via Agronomic and Economic Forecasting2025-07-06DeepGesture: A conversational gesture synthesis system based on emotions and semantics2025-07-03OpusLM: A Family of Open Unified Speech Language Models2025-06-21RapFlow-TTS: Rapid and High-Fidelity Text-to-Speech with Improved Consistency Flow Matching2025-06-20InstructTTSEval: Benchmarking Complex Natural-Language Instruction Following in Text-to-Speech Systems2025-06-19An accurate and revised version of optical character recognition-based speech synthesis using LabVIEW2025-06-18