A Spectral Energy Distance for Parallel Speech Synthesis

Alexey A. Gritsenko, Tim Salimans, Rianne van den Berg, Jasper Snoek, Nal Kalchbrenner

2020-08-03NeurIPS 2020 12Speech Synthesis

Abstract

Speech synthesis is an important practical generative modeling problem that has seen great progress over the last few years, with likelihood-based autoregressive neural models now outperforming traditional concatenative systems. A downside of such autoregressive models is that they require executing tens of thousands of sequential operations per second of generated audio, making them ill-suited for deployment on specialized deep learning hardware. Here, we propose a new learning method that allows us to train highly parallel models of speech, without requiring access to an analytical likelihood function. Our approach is based on a generalized energy distance between the distributions of the generated and real audio. This spectral energy distance is a proper scoring rule with respect to the distribution over magnitude-spectrograms of the generated waveform audio and offers statistical consistency guarantees. The distance can be calculated from minibatches without bias, and does not involve adversarial learning, yielding a stable and consistent method for training implicit generative models. Empirically, we achieve state-of-the-art generation quality among implicit generative models, as judged by the recently-proposed cFDSD metric. When combining our method with adversarial techniques, we also improve upon the recently-proposed GAN-TTS model in terms of Mean Opinion Score as judged by trained human evaluators.

Related Papers

NonverbalTTS: A Public English Corpus of Text-Aligned Nonverbal Vocalizations with Emotion Annotations for Text-to-Speech2025-07-17 Speech Quality Assessment Model Based on Mixture of Experts: System-Level Performance Enhancement and Utterance-Level Challenge Analysis2025-07-08 A Hybrid Machine Learning Framework for Optimizing Crop Selection via Agronomic and Economic Forecasting2025-07-06 DeepGesture: A conversational gesture synthesis system based on emotions and semantics2025-07-03 OpusLM: A Family of Open Unified Speech Language Models2025-06-21 RapFlow-TTS: Rapid and High-Fidelity Text-to-Speech with Improved Consistency Flow Matching2025-06-20 InstructTTSEval: Benchmarking Complex Natural-Language Instruction Following in Text-to-Speech Systems2025-06-19 An accurate and revised version of optical character recognition-based speech synthesis using LabVIEW2025-06-18