SampleRNN: An Unconditional End-to-End Neural Audio Generation Model

Soroush Mehri, Kundan Kumar, Ishaan Gulrajani, Rithesh Kumar, Shubham Jain, Jose Sotelo, Aaron Courville, Yoshua Bengio

2016-12-22Audio Generation Speech Synthesis Temporal Sequences

Abstract

In this paper we propose a novel model for unconditional audio generation based on generating one audio sample at a time. We show that our model, which profits from combining memory-less modules, namely autoregressive multilayer perceptrons, and stateful recurrent neural networks in a hierarchical structure is able to capture underlying sources of variations in the temporal sequences over very long time spans, on three datasets of different nature. Human evaluation on the generated samples indicate that our model is preferred over competing models. We also show how each component of the model contributes to the exhibited performance.

Results

Task	Dataset	Metric	Value	Model
Speech Recognition	Blizzard Challenge 2013	NLL	1.387	SampleRNN (3-tier)
Speech Recognition	Blizzard Challenge 2013	NLL	1.392	SampleRNN (2-tier)
Speech Synthesis	Blizzard Challenge 2013	NLL	1.387	SampleRNN (3-tier)
Speech Synthesis	Blizzard Challenge 2013	NLL	1.392	SampleRNN (2-tier)
Accented Speech Recognition	Blizzard Challenge 2013	NLL	1.387	SampleRNN (3-tier)
Accented Speech Recognition	Blizzard Challenge 2013	NLL	1.392	SampleRNN (2-tier)

Related Papers

NonverbalTTS: A Public English Corpus of Text-Aligned Nonverbal Vocalizations with Emotion Annotations for Text-to-Speech2025-07-17 FreeAudio: Training-Free Timing Planning for Controllable Long-Form Text-to-Audio Generation2025-07-11 Speech Quality Assessment Model Based on Mixture of Experts: System-Level Performance Enhancement and Utterance-Level Challenge Analysis2025-07-08 A Hybrid Machine Learning Framework for Optimizing Crop Selection via Agronomic and Economic Forecasting2025-07-06 DeepGesture: A conversational gesture synthesis system based on emotions and semantics2025-07-03 ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing2025-06-26 Step-by-Step Video-to-Audio Synthesis via Negative Audio Guidance2025-06-26 Kling-Foley: Multimodal Diffusion Transformer for High-Quality Video-to-Audio Generation2025-06-24