Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition

Dima Rekesh, Nithin Rao Koluguri, Samuel Kriman, Somshubra Majumdar, Vahid Noroozi, He Huang, Oleksii Hrinchuk, Krishna Puvvada, Ankur Kumar, Jagadeesh Balam, Boris Ginsburg

2023-05-08Speech Recognition Automatic Speech Recognition speech-recognition Spoken Language Understanding Translation

Paper PDF

Abstract

Conformer-based models have become the dominant end-to-end architecture for speech processing tasks. With the objective of enhancing the conformer architecture for efficient training and inference, we carefully redesigned Conformer with a novel downsampling schema. The proposed model, named Fast Conformer(FC), is 2.8x faster than the original Conformer, supports scaling to Billion parameters without any changes to the core architecture and also achieves state-of-the-art accuracy on Automatic Speech Recognition benchmarks. To enable transcription of long-form speech up to 11 hours, we replaced global attention with limited context attention post-training, while also improving accuracy through fine-tuning with the addition of a global token. Fast Conformer, when combined with a Transformer decoder also outperforms the original Conformer in accuracy and in speed for Speech Translation and Spoken Language Understanding.

Results

Task	Dataset	Metric	Value	Model
Speech Recognition	Tedlium	Word Error Rate (WER)	3.92	parakeet-rnnt-1.1b
Speech Recognition	SPGISpeech	Word Error Rate (WER)	3.11	parakeet-rnnt-1.1b
Speech Recognition	LibriSpeech test-clean	Word Error Rate (WER)	1.46	parakeet-rnnt-1.1b

Related Papers

Task-Specific Audio Coding for Machines: Machine-Learned Latent Features Are Codes for That Machine2025-07-17 NonverbalTTS: A Public English Corpus of Text-Aligned Nonverbal Vocalizations with Emotion Annotations for Text-to-Speech2025-07-17 A Translation of Probabilistic Event Calculus into Markov Decision Processes2025-07-17 Function-to-Style Guidance of LLMs for Code Translation2025-07-15 WhisperKit: On-device Real-time ASR with Billion-Scale Transformers2025-07-14 Speak2Sign3D: A Multi-modal Pipeline for English Speech to American Sign Language Animation2025-07-09 Pun Intended: Multi-Agent Translation of Wordplay with Contrastive Learning and Phonetic-Semantic Embeddings2025-07-09 VisualSpeaker: Visually-Guided 3D Avatar Lip Synthesis2025-07-08