TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Deep Speech: Scaling up end-to-end speech recognition

Deep Speech: Scaling up end-to-end speech recognition

Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, Andrew Y. Ng

2014-12-17Speech RecognitionAccented Speech Recognition
PaperPDFCodeCodeCodeCodeCodeCode(official)CodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCode

Abstract

We present a state-of-the-art speech recognition system developed using end-to-end deep learning. Our architecture is significantly simpler than traditional speech systems, which rely on laboriously engineered processing pipelines; these traditional systems also tend to perform poorly when used in noisy environments. In contrast, our system does not need hand-designed components to model background noise, reverberation, or speaker variation, but instead directly learns a function that is robust to such effects. We do not need a phoneme dictionary, nor even the concept of a "phoneme." Key to our approach is a well-optimized RNN training system that uses multiple GPUs, as well as a set of novel data synthesis techniques that allow us to efficiently obtain a large amount of varied data for training. Our system, called Deep Speech, outperforms previously published results on the widely studied Switchboard Hub5'00, achieving 16.0% error on the full test set. Deep Speech also handles challenging noisy environments better than widely used, state-of-the-art commercial speech systems.

Results

TaskDatasetMetricValueModel
Speech Recognitionswb_hub_500 WER fullSWBCHPercentage error16CNN + Bi-RNN + CTC (speech to letters), 25.9% WER if trainedonlyon SWB
Speech RecognitionSwitchboard + Hub500Percentage error12.6Deep Speech + FSH
Speech RecognitionSwitchboard + Hub500Percentage error12.6CNN + Bi-RNN + CTC (speech to letters), 25.9% WER if trainedonlyon SWB
Speech RecognitionSwitchboard + Hub500Percentage error20Deep Speech
Speech RecognitionVoxForge EuropeanPercentage error31.2Deep Speech
Speech RecognitionVoxForge American-CanadianPercentage error15.01Deep Speech
Speech RecognitionVoxForge IndianPercentage error45.35Deep Speech
Speech RecognitionVoxForge CommonwealthPercentage error28.46Deep Speech
Speech RecognitionCHiME realPercentage error67.94CNN + Bi-RNN + CTC (speech to letters)
Speech RecognitionCHiME cleanPercentage error6.3CNN + Bi-RNN + CTC (speech to letters)
Accented Speech RecognitionVoxForge EuropeanPercentage error31.2Deep Speech
Accented Speech RecognitionVoxForge American-CanadianPercentage error15.01Deep Speech
Accented Speech RecognitionVoxForge IndianPercentage error45.35Deep Speech
Accented Speech RecognitionVoxForge CommonwealthPercentage error28.46Deep Speech

Related Papers

Task-Specific Audio Coding for Machines: Machine-Learned Latent Features Are Codes for That Machine2025-07-17NonverbalTTS: A Public English Corpus of Text-Aligned Nonverbal Vocalizations with Emotion Annotations for Text-to-Speech2025-07-17WhisperKit: On-device Real-time ASR with Billion-Scale Transformers2025-07-14VisualSpeaker: Visually-Guided 3D Avatar Lip Synthesis2025-07-08A Hybrid Machine Learning Framework for Optimizing Crop Selection via Agronomic and Economic Forecasting2025-07-06First Steps Towards Voice Anonymization for Code-Switching Speech2025-07-02MambAttention: Mamba with Multi-Head Attention for Generalizable Single-Channel Speech Enhancement2025-07-01AUTOMATIC PRONUNCIATION MISTAKE DETECTOR PROJECT REPORT2025-06-25