Espresso: A Fast End-to-end Neural Speech Recognition Toolkit

Yiming Wang, Tongfei Chen, Hainan Xu, Shuoyang Ding, Hang Lv, Yiwen Shao, Nanyun Peng, Lei Xie, Shinji Watanabe, Sanjeev Khudanpur

2019-09-18Speech Recognition Machine Translation Automatic Speech Recognition Automatic Speech Recognition (ASR)speech-recognition Data Augmentation Translation Language Modelling

Paper PDF Code(official)

Abstract

We present Espresso, an open-source, modular, extensible end-to-end neural automatic speech recognition (ASR) toolkit based on the deep learning library PyTorch and the popular neural machine translation toolkit fairseq. Espresso supports distributed training across GPUs and computing nodes, and features various decoding approaches commonly employed in ASR, including look-ahead word-based language model fusion, for which a fast, parallelized decoder is implemented. Espresso achieves state-of-the-art ASR performance on the WSJ, LibriSpeech, and Switchboard data sets among other end-to-end systems without data augmentation, and is 4--11x faster for decoding than similar systems (e.g. ESPnet).

Results

Task	Dataset	Metric	Value	Model
Speech Recognition	WSJ eval92	Word Error Rate (WER)	3.4	Espresso
Speech Recognition	Hub5'00 SwitchBoard	Eval2000	9.2	Espresso
Speech Recognition	Hub5'00 CallHome	Word Error Rate (WER)	19.1	Espresso
Speech Recognition	LibriSpeech test-clean	Word Error Rate (WER)	2.8	Espresso
Speech Recognition	LibriSpeech test-other	Word Error Rate (WER)	8.7	Espresso

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21 Task-Specific Audio Coding for Machines: Machine-Learned Latent Features Are Codes for That Machine2025-07-17 NonverbalTTS: A Public English Corpus of Text-Aligned Nonverbal Vocalizations with Emotion Annotations for Text-to-Speech2025-07-17 Overview of the TalentCLEF 2025: Skill and Job Title Intelligence for Human Capital Management2025-07-17 Pixel Perfect MegaMed: A Megapixel-Scale Vision-Language Foundation Model for Generating High Resolution Medical Images2025-07-17 A Translation of Probabilistic Event Calculus into Markov Decision Processes2025-07-17 Making Language Model a Hierarchical Classifier and Generator2025-07-17 VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17