Sang-Hoon Lee, Ha-Yeong Choi, Seong-Whan Lee
This paper introduces PeriodWave-Turbo, a high-fidelity and high-efficient waveform generation model via adversarial flow matching optimization. Recently, conditional flow matching (CFM) generative models have been successfully adopted for waveform generation tasks, leveraging a single vector field estimation objective for training. Although these models can generate high-fidelity waveform signals, they require significantly more ODE steps compared to GAN-based models, which only need a single generation step. Additionally, the generated samples often lack high-frequency information due to noisy vector field estimation, which fails to ensure high-frequency reproduction. To address this limitation, we enhance pre-trained CFM-based generative models by incorporating a fixed-step generator modification. We utilized reconstruction losses and adversarial feedback to accelerate high-fidelity waveform generation. Through adversarial flow matching optimization, it only requires 1,000 steps of fine-tuning to achieve state-of-the-art performance across various objective metrics. Moreover, we significantly reduce inference speed from 16 steps to 2 or 4 steps. Additionally, by scaling up the backbone of PeriodWave from 29M to 70M parameters for improved generalization, PeriodWave-Turbo achieves unprecedented performance, with a perceptual evaluation of speech quality (PESQ) score of 4.454 on the LibriTTS dataset. Audio samples, source code and checkpoints will be available at https://github.com/sh-lee-prml/PeriodWave.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Speech Recognition | LibriTTS | M-STFT | 0.7358 | PeriodWave-Turbo-L |
| Speech Recognition | LibriTTS | PESQ | 4.454 | PeriodWave-Turbo-L |
| Speech Recognition | LibriTTS | Periodicity | 0.0528 | PeriodWave-Turbo-L |
| Speech Recognition | LibriTTS | V/UV F1 | 0.9756 | PeriodWave-Turbo-L |
| Speech Synthesis | LibriTTS | M-STFT | 0.7358 | PeriodWave-Turbo-L |
| Speech Synthesis | LibriTTS | PESQ | 4.454 | PeriodWave-Turbo-L |
| Speech Synthesis | LibriTTS | Periodicity | 0.0528 | PeriodWave-Turbo-L |
| Speech Synthesis | LibriTTS | V/UV F1 | 0.9756 | PeriodWave-Turbo-L |
| Accented Speech Recognition | LibriTTS | M-STFT | 0.7358 | PeriodWave-Turbo-L |
| Accented Speech Recognition | LibriTTS | PESQ | 4.454 | PeriodWave-Turbo-L |
| Accented Speech Recognition | LibriTTS | Periodicity | 0.0528 | PeriodWave-Turbo-L |
| Accented Speech Recognition | LibriTTS | V/UV F1 | 0.9756 | PeriodWave-Turbo-L |