Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, Andrew Y. Ng
We present a state-of-the-art speech recognition system developed using end-to-end deep learning. Our architecture is significantly simpler than traditional speech systems, which rely on laboriously engineered processing pipelines; these traditional systems also tend to perform poorly when used in noisy environments. In contrast, our system does not need hand-designed components to model background noise, reverberation, or speaker variation, but instead directly learns a function that is robust to such effects. We do not need a phoneme dictionary, nor even the concept of a "phoneme." Key to our approach is a well-optimized RNN training system that uses multiple GPUs, as well as a set of novel data synthesis techniques that allow us to efficiently obtain a large amount of varied data for training. Our system, called Deep Speech, outperforms previously published results on the widely studied Switchboard Hub5'00, achieving 16.0% error on the full test set. Deep Speech also handles challenging noisy environments better than widely used, state-of-the-art commercial speech systems.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Speech Recognition | swb_hub_500 WER fullSWBCH | Percentage error | 16 | CNN + Bi-RNN + CTC (speech to letters), 25.9% WER if trainedonlyon SWB |
| Speech Recognition | Switchboard + Hub500 | Percentage error | 12.6 | Deep Speech + FSH |
| Speech Recognition | Switchboard + Hub500 | Percentage error | 12.6 | CNN + Bi-RNN + CTC (speech to letters), 25.9% WER if trainedonlyon SWB |
| Speech Recognition | Switchboard + Hub500 | Percentage error | 20 | Deep Speech |
| Speech Recognition | VoxForge European | Percentage error | 31.2 | Deep Speech |
| Speech Recognition | VoxForge American-Canadian | Percentage error | 15.01 | Deep Speech |
| Speech Recognition | VoxForge Indian | Percentage error | 45.35 | Deep Speech |
| Speech Recognition | VoxForge Commonwealth | Percentage error | 28.46 | Deep Speech |
| Speech Recognition | CHiME real | Percentage error | 67.94 | CNN + Bi-RNN + CTC (speech to letters) |
| Speech Recognition | CHiME clean | Percentage error | 6.3 | CNN + Bi-RNN + CTC (speech to letters) |
| Accented Speech Recognition | VoxForge European | Percentage error | 31.2 | Deep Speech |
| Accented Speech Recognition | VoxForge American-Canadian | Percentage error | 15.01 | Deep Speech |
| Accented Speech Recognition | VoxForge Indian | Percentage error | 45.35 | Deep Speech |
| Accented Speech Recognition | VoxForge Commonwealth | Percentage error | 28.46 | Deep Speech |