A Comparative Study on Transformer vs RNN in Speech Applications

Shigeki Karita, Nanxin Chen, Tomoki Hayashi, Takaaki Hori, Hirofumi Inaguma, Ziyan Jiang, Masao Someki, Nelson Enrique Yalta Soplin, Ryuichi Yamamoto, Xiaofei Wang, Shinji Watanabe, Takenori Yoshimura, Wangyou Zhang

2019-09-13Speech Recognition Machine Translation Automatic Speech Recognition Automatic Speech Recognition (ASR)speech-recognition Text to Speech Translation text-to-speech

Paper PDF Code Code

Abstract

Sequence-to-sequence models have been widely used in end-to-end speech processing, for example, automatic speech recognition (ASR), speech translation (ST), and text-to-speech (TTS). This paper focuses on an emergent sequence-to-sequence model called Transformer, which achieves state-of-the-art performance in neural machine translation and other natural language processing applications. We undertook intensive studies in which we experimentally compared and analyzed Transformer and conventional recurrent neural networks (RNN) in a total of 15 ASR, one multilingual ASR, one ST, and two TTS benchmarks. Our experiments revealed various training tips and significant performance benefits obtained with Transformer for each task including the surprising superiority of Transformer in 13/15 ASR benchmarks in comparison with RNN. We are preparing to release Kaldi-style reproducible recipes using open source and publicly available datasets for all the ASR, ST, and TTS tasks for the community to succeed our exciting outcomes.

Results

Task	Dataset	Metric	Value	Model
Speech Recognition	LibriSpeech test-clean	Word Error Rate (WER)	2.6	Transformer
Speech Recognition	LibriSpeech test-other	Word Error Rate (WER)	5.7	Transformer
Speech Recognition	AISHELL-1	Word Error Rate (WER)	6.7	CTC/Att

Related Papers

Hear Your Code Fail, Voice-Assisted Debugging for Python2025-07-20 Task-Specific Audio Coding for Machines: Machine-Learned Latent Features Are Codes for That Machine2025-07-17 NonverbalTTS: A Public English Corpus of Text-Aligned Nonverbal Vocalizations with Emotion Annotations for Text-to-Speech2025-07-17 A Translation of Probabilistic Event Calculus into Markov Decision Processes2025-07-17 P.808 Multilingual Speech Enhancement Testing: Approach and Results of URGENT 2025 Challenge2025-07-15 Function-to-Style Guidance of LLMs for Code Translation2025-07-15 WhisperKit: On-device Real-time ASR with Billion-Scale Transformers2025-07-14 An Empirical Evaluation of AI-Powered Non-Player Characters' Perceived Realism and Performance in Virtual Reality Environments2025-07-14