TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/SampleRNN: An Unconditional End-to-End Neural Audio Genera...

SampleRNN: An Unconditional End-to-End Neural Audio Generation Model

Soroush Mehri, Kundan Kumar, Ishaan Gulrajani, Rithesh Kumar, Shubham Jain, Jose Sotelo, Aaron Courville, Yoshua Bengio

2016-12-22Audio GenerationSpeech SynthesisTemporal Sequences
PaperPDFCodeCodeCode(official)Code

Abstract

In this paper we propose a novel model for unconditional audio generation based on generating one audio sample at a time. We show that our model, which profits from combining memory-less modules, namely autoregressive multilayer perceptrons, and stateful recurrent neural networks in a hierarchical structure is able to capture underlying sources of variations in the temporal sequences over very long time spans, on three datasets of different nature. Human evaluation on the generated samples indicate that our model is preferred over competing models. We also show how each component of the model contributes to the exhibited performance.

Results

TaskDatasetMetricValueModel
Speech RecognitionBlizzard Challenge 2013NLL1.387SampleRNN (3-tier)
Speech RecognitionBlizzard Challenge 2013NLL1.392SampleRNN (2-tier)
Speech SynthesisBlizzard Challenge 2013NLL1.387SampleRNN (3-tier)
Speech SynthesisBlizzard Challenge 2013NLL1.392SampleRNN (2-tier)
Accented Speech RecognitionBlizzard Challenge 2013NLL1.387SampleRNN (3-tier)
Accented Speech RecognitionBlizzard Challenge 2013NLL1.392SampleRNN (2-tier)

Related Papers

NonverbalTTS: A Public English Corpus of Text-Aligned Nonverbal Vocalizations with Emotion Annotations for Text-to-Speech2025-07-17FreeAudio: Training-Free Timing Planning for Controllable Long-Form Text-to-Audio Generation2025-07-11Speech Quality Assessment Model Based on Mixture of Experts: System-Level Performance Enhancement and Utterance-Level Challenge Analysis2025-07-08A Hybrid Machine Learning Framework for Optimizing Crop Selection via Agronomic and Economic Forecasting2025-07-06DeepGesture: A conversational gesture synthesis system based on emotions and semantics2025-07-03ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing2025-06-26Step-by-Step Video-to-Audio Synthesis via Negative Audio Guidance2025-06-26Kling-Foley: Multimodal Diffusion Transformer for High-Quality Video-to-Audio Generation2025-06-24