TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Deep Speech 2: End-to-End Speech Recognition in English an...

Deep Speech 2: End-to-End Speech Recognition in English and Mandarin

Dario Amodei, Rishita Anubhai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Jingdong Chen, Mike Chrzanowski, Adam Coates, Greg Diamos, Erich Elsen, Jesse Engel, Linxi Fan, Christopher Fougner, Tony Han, Awni Hannun, Billy Jun, Patrick LeGresley, Libby Lin, Sharan Narang, Andrew Ng, Sherjil Ozair, Ryan Prenger, Jonathan Raiman, Sanjeev Satheesh, David Seetapun, Shubho Sengupta, Yi Wang, Zhiqian Wang, Chong Wang, Bo Xiao, Dani Yogatama, Jun Zhan, Zhenyao Zhu

2015-12-08Speech RecognitionAccented Speech Recognition
PaperPDFCodeCodeCodeCodeCodeCodeCodeCode(official)CodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCode

Abstract

We show that an end-to-end deep learning approach can be used to recognize either English or Mandarin Chinese speech--two vastly different languages. Because it replaces entire pipelines of hand-engineered components with neural networks, end-to-end learning allows us to handle a diverse variety of speech including noisy environments, accents and different languages. Key to our approach is our application of HPC techniques, resulting in a 7x speedup over our previous system. Because of this efficiency, experiments that previously took weeks now run in days. This enables us to iterate more quickly to identify superior architectures and algorithms. As a result, in several cases, our system is competitive with the transcription of human workers when benchmarked on standard datasets. Finally, using a technique called Batch Dispatch with GPUs in the data center, we show that our system can be inexpensively deployed in an online setting, delivering low latency when serving users at scale.

Results

TaskDatasetMetricValueModel
Speech RecognitionWSJ eval92Word Error Rate (WER)3.6Deep Speech 2
Speech RecognitionWSJ eval93Word Error Rate (WER)4.98Deep Speech 2
Speech RecognitionLibriSpeech test-cleanWord Error Rate (WER)5.33Deep Speech 2
Speech RecognitionLibriSpeech test-otherWord Error Rate (WER)13.25Deep Speech 2
Speech RecognitionVoxForge EuropeanPercentage error17.55Deep Speech 2
Speech RecognitionVoxForge American-CanadianPercentage error7.55Deep Speech 2
Speech RecognitionVoxForge IndianPercentage error22.44Deep Speech 2
Speech RecognitionVoxForge CommonwealthPercentage error13.56Deep Speech 2
Speech RecognitionCHiME realPercentage error21.79Deep Speech 2
Speech RecognitionCHiME cleanPercentage error3.34Deep Speech 2
Accented Speech RecognitionVoxForge EuropeanPercentage error17.55Deep Speech 2
Accented Speech RecognitionVoxForge American-CanadianPercentage error7.55Deep Speech 2
Accented Speech RecognitionVoxForge IndianPercentage error22.44Deep Speech 2
Accented Speech RecognitionVoxForge CommonwealthPercentage error13.56Deep Speech 2

Related Papers

Task-Specific Audio Coding for Machines: Machine-Learned Latent Features Are Codes for That Machine2025-07-17NonverbalTTS: A Public English Corpus of Text-Aligned Nonverbal Vocalizations with Emotion Annotations for Text-to-Speech2025-07-17WhisperKit: On-device Real-time ASR with Billion-Scale Transformers2025-07-14VisualSpeaker: Visually-Guided 3D Avatar Lip Synthesis2025-07-08A Hybrid Machine Learning Framework for Optimizing Crop Selection via Agronomic and Economic Forecasting2025-07-06First Steps Towards Voice Anonymization for Code-Switching Speech2025-07-02MambAttention: Mamba with Multi-Head Attention for Generalizable Single-Channel Speech Enhancement2025-07-01AUTOMATIC PRONUNCIATION MISTAKE DETECTOR PROJECT REPORT2025-06-25