Recurrent Neural Network Transducer for Audio-Visual Speech Recognition

Takaki Makino, Hank Liao, Yannis Assael, Brendan Shillingford, Basilio Garcia, Otavio Braga, Olivier Siohan

2019-11-08Speech Recognition speech-recognition Audio-Visual Speech Recognition Visual Speech Recognition Lipreading

Abstract

This work presents a large-scale audio-visual speech recognition system based on a recurrent neural network transducer (RNN-T) architecture. To support the development of such a system, we built a large audio-visual (A/V) dataset of segmented utterances extracted from YouTube public videos, leading to 31k hours of audio-visual training content. The performance of an audio-only, visual-only, and audio-visual system are compared on two large-vocabulary test sets: a set of utterance segments from public YouTube videos called YTDEV18 and the publicly available LRS3-TED set. To highlight the contribution of the visual modality, we also evaluated the performance of our system on the YTDEV18 set artificially corrupted with background noise and overlapping speech. To the best of our knowledge, our system significantly improves the state-of-the-art on the LRS3-TED set.

Results

Task	Dataset	Metric	Value	Model
Audio-Visual Speech Recognition	LRS3-TED	Word Error Rate (WER)	4.5	RNN-T
Lipreading	LRS3-TED	Word Error Rate (WER)	33.6	RNN-T
Natural Language Transduction	LRS3-TED	Word Error Rate (WER)	33.6	RNN-T

Related Papers

Task-Specific Audio Coding for Machines: Machine-Learned Latent Features Are Codes for That Machine2025-07-17 NonverbalTTS: A Public English Corpus of Text-Aligned Nonverbal Vocalizations with Emotion Annotations for Text-to-Speech2025-07-17 WhisperKit: On-device Real-time ASR with Billion-Scale Transformers2025-07-14 VisualSpeaker: Visually-Guided 3D Avatar Lip Synthesis2025-07-08 A Hybrid Machine Learning Framework for Optimizing Crop Selection via Agronomic and Economic Forecasting2025-07-06 First Steps Towards Voice Anonymization for Code-Switching Speech2025-07-02 MambAttention: Mamba with Multi-Head Attention for Generalizable Single-Channel Speech Enhancement2025-07-01 AUTOMATIC PRONUNCIATION MISTAKE DETECTOR PROJECT REPORT2025-06-25