OverFlow: Putting flows on top of neural transducers for better TTS

Shivam Mehta, Ambika Kirkland, Harm Lameris, Jonas Beskow, Éva Székely, Gustav Eje Henter

2022-11-13Text to Speech Speech Synthesis Text-To-Speech Synthesis text-to-speech

Abstract

Neural HMMs are a type of neural transducer recently proposed for sequence-to-sequence modelling in text-to-speech. They combine the best features of classic statistical speech synthesis and modern neural TTS, requiring less data and fewer training updates, and are less prone to gibberish output caused by neural attention failures. In this paper, we combine neural HMM TTS with normalising flows for describing the highly non-Gaussian distribution of speech acoustics. The result is a powerful, fully probabilistic model of durations and acoustics that can be trained using exact maximum likelihood. Experiments show that a system based on our proposal needs fewer updates than comparable methods to produce accurate pronunciations and a subjective speech quality close to natural speech. Please see https://shivammehta25.github.io/OverFlow/ for audio examples and code.

Results

Task	Dataset	Metric	Value	Model
Text-To-Speech Synthesis	LJSpeech	Audio Quality MOS	3.37	OverFlow
Text-To-Speech Synthesis	LJSpeech	Word Error Rate (WER)	2.3	OverFlow

Related Papers

Hear Your Code Fail, Voice-Assisted Debugging for Python2025-07-20 NonverbalTTS: A Public English Corpus of Text-Aligned Nonverbal Vocalizations with Emotion Annotations for Text-to-Speech2025-07-17 P.808 Multilingual Speech Enhancement Testing: Approach and Results of URGENT 2025 Challenge2025-07-15 An Empirical Evaluation of AI-Powered Non-Player Characters' Perceived Realism and Performance in Virtual Reality Environments2025-07-14 ZipVoice-Dialog: Non-Autoregressive Spoken Dialogue Generation with Flow Matching2025-07-12 Exploiting Leaderboards for Large-Scale Distribution of Malicious Models2025-07-11 MIDI-VALLE: Improving Expressive Piano Performance Synthesis Through Neural Codec Language Modelling2025-07-11 Speech Quality Assessment Model Based on Mixture of Experts: System-Level Performance Enhancement and Utterance-Level Challenge Analysis2025-07-08