Deep Speech: Scaling up end-to-end speech recognition

Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, Andrew Y. Ng

2014-12-17Speech Recognition Accented Speech Recognition

Paper PDF Code Code Code Code Code Code(official)Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code

Abstract

We present a state-of-the-art speech recognition system developed using end-to-end deep learning. Our architecture is significantly simpler than traditional speech systems, which rely on laboriously engineered processing pipelines; these traditional systems also tend to perform poorly when used in noisy environments. In contrast, our system does not need hand-designed components to model background noise, reverberation, or speaker variation, but instead directly learns a function that is robust to such effects. We do not need a phoneme dictionary, nor even the concept of a "phoneme." Key to our approach is a well-optimized RNN training system that uses multiple GPUs, as well as a set of novel data synthesis techniques that allow us to efficiently obtain a large amount of varied data for training. Our system, called Deep Speech, outperforms previously published results on the widely studied Switchboard Hub5'00, achieving 16.0% error on the full test set. Deep Speech also handles challenging noisy environments better than widely used, state-of-the-art commercial speech systems.

Results

Task	Dataset	Metric	Value	Model
Speech Recognition	swb_hub_500 WER fullSWBCH	Percentage error	16	CNN + Bi-RNN + CTC (speech to letters), 25.9% WER if trainedonlyon SWB
Speech Recognition	Switchboard + Hub500	Percentage error	12.6	Deep Speech + FSH
Speech Recognition	Switchboard + Hub500	Percentage error	12.6	CNN + Bi-RNN + CTC (speech to letters), 25.9% WER if trainedonlyon SWB
Speech Recognition	Switchboard + Hub500	Percentage error	20	Deep Speech
Speech Recognition	VoxForge European	Percentage error	31.2	Deep Speech
Speech Recognition	VoxForge American-Canadian	Percentage error	15.01	Deep Speech
Speech Recognition	VoxForge Indian	Percentage error	45.35	Deep Speech
Speech Recognition	VoxForge Commonwealth	Percentage error	28.46	Deep Speech
Speech Recognition	CHiME real	Percentage error	67.94	CNN + Bi-RNN + CTC (speech to letters)
Speech Recognition	CHiME clean	Percentage error	6.3	CNN + Bi-RNN + CTC (speech to letters)
Accented Speech Recognition	VoxForge European	Percentage error	31.2	Deep Speech
Accented Speech Recognition	VoxForge American-Canadian	Percentage error	15.01	Deep Speech
Accented Speech Recognition	VoxForge Indian	Percentage error	45.35	Deep Speech
Accented Speech Recognition	VoxForge Commonwealth	Percentage error	28.46	Deep Speech

Deep Speech: Scaling up end-to-end speech recognition

Abstract

Results

Related Papers

Deep Speech: Scaling up end-to-end speech recognition

Abstract

Results

Related Papers