Jasper: An End-to-End Convolutional Neural Acoustic Model

Jason Li, Vitaly Lavrukhin, Boris Ginsburg, Ryan Leary, Oleksii Kuchaiev, Jonathan M. Cohen, Huyen Nguyen, Ravi Teja Gadde

2019-04-05Speech Recognition Language Modelling

Paper PDF Code Code Code Code Code Code Code Code Code Code

Abstract

In this paper, we report state-of-the-art results on LibriSpeech among end-to-end speech recognition models without any external training data. Our model, Jasper, uses only 1D convolutions, batch normalization, ReLU, dropout, and residual connections. To improve training, we further introduce a new layer-wise optimizer called NovoGrad. Through experiments, we demonstrate that the proposed deep architecture performs as well or better than more complex choices. Our deepest Jasper variant uses 54 convolutional layers. With this architecture, we achieve 2.95% WER using a beam-search decoder with an external neural language model and 3.86% WER with a greedy decoder on LibriSpeech test-clean. We also report competitive results on the Wall Street Journal and the Hub5'00 conversational evaluation datasets.

Results

Task	Dataset	Metric	Value	Model
Speech Recognition	WSJ eval92	Word Error Rate (WER)	6.9	Jasper 10x3
Speech Recognition	Hub5'00 SwitchBoard	CallHome	16.2	Jasper DR 10x5
Speech Recognition	Hub5'00 SwitchBoard	SwitchBoard	7.8	Jasper DR 10x5
Speech Recognition	LibriSpeech test-clean	Word Error Rate (WER)	2.84	Jasper DR 10x5 (+ Time/Freq Masks)
Speech Recognition	LibriSpeech test-clean	Word Error Rate (WER)	2.95	Jasper DR 10x5
Speech Recognition	LibriSpeech test-other	Word Error Rate (WER)	7.84	Jasper DR 10x5 (+ Time/Freq Masks)
Speech Recognition	LibriSpeech test-other	Word Error Rate (WER)	8.79	Jasper DR 10x5

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21 Task-Specific Audio Coding for Machines: Machine-Learned Latent Features Are Codes for That Machine2025-07-17 NonverbalTTS: A Public English Corpus of Text-Aligned Nonverbal Vocalizations with Emotion Annotations for Text-to-Speech2025-07-17 Making Language Model a Hierarchical Classifier and Generator2025-07-17 VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17 The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations2025-07-17 Inverse Reinforcement Learning Meets Large Language Model Post-Training: Basics, Advances, and Opportunities2025-07-17 Assay2Mol: large language model-based drug design using BioAssay context2025-07-16