RFWave: Multi-band Rectified Flow for Audio Waveform Reconstruction

Peng Liu, Dongyang Dai, Zhiyong Wu

2024-03-08Audio Generation Speech Synthesis

Abstract

Recent advancements in generative modeling have significantly enhanced the reconstruction of audio waveforms from various representations. While diffusion models are adept at this task, they are hindered by latency issues due to their operation at the individual sample point level and the need for numerous sampling steps. In this study, we introduce RFWave, a cutting-edge multi-band Rectified Flow approach designed to reconstruct high-fidelity audio waveforms from Mel-spectrograms or discrete acoustic tokens. RFWave uniquely generates complex spectrograms and operates at the frame level, processing all subbands simultaneously to boost efficiency. Leveraging Rectified Flow, which targets a straight transport trajectory, RFWave achieves reconstruction with just 10 sampling steps. Our empirical evaluations show that RFWave not only provides outstanding reconstruction quality but also offers vastly superior computational efficiency, enabling audio generation at speeds up to 160 times faster than real-time on a GPU. An online demonstration is available at: https://rfwave-demo.github.io/rfwave/.

Results

Task	Dataset	Metric	Value	Model
Speech Recognition	LibriTTS	PESQ	4.228	RFWave
Speech Recognition	LibriTTS	Periodicity	0.09	RFWave
Speech Recognition	LibriTTS	V/UV F1	0.968	RFWave
Speech Synthesis	LibriTTS	PESQ	4.228	RFWave
Speech Synthesis	LibriTTS	Periodicity	0.09	RFWave
Speech Synthesis	LibriTTS	V/UV F1	0.968	RFWave
Accented Speech Recognition	LibriTTS	PESQ	4.228	RFWave
Accented Speech Recognition	LibriTTS	Periodicity	0.09	RFWave
Accented Speech Recognition	LibriTTS	V/UV F1	0.968	RFWave

Related Papers

NonverbalTTS: A Public English Corpus of Text-Aligned Nonverbal Vocalizations with Emotion Annotations for Text-to-Speech2025-07-17 FreeAudio: Training-Free Timing Planning for Controllable Long-Form Text-to-Audio Generation2025-07-11 Speech Quality Assessment Model Based on Mixture of Experts: System-Level Performance Enhancement and Utterance-Level Challenge Analysis2025-07-08 A Hybrid Machine Learning Framework for Optimizing Crop Selection via Agronomic and Economic Forecasting2025-07-06 DeepGesture: A conversational gesture synthesis system based on emotions and semantics2025-07-03 ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing2025-06-26 Step-by-Step Video-to-Audio Synthesis via Negative Audio Guidance2025-06-26 Kling-Foley: Multimodal Diffusion Transformer for High-Quality Video-to-Audio Generation2025-06-24