TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/LipNet: End-to-End Sentence-level Lipreading

LipNet: End-to-End Sentence-level Lipreading

Yannis M. Assael, Brendan Shillingford, Shimon Whiteson, Nando de Freitas

2016-11-05LipreadingGeneral Classification
PaperPDFCodeCodeCodeCodeCodeCodeCodeCodeCodeCode(official)CodeCodeCode

Abstract

Lipreading is the task of decoding text from the movement of a speaker's mouth. Traditional approaches separated the problem into two stages: designing or learning visual features, and prediction. More recent deep lipreading approaches are end-to-end trainable (Wand et al., 2016; Chung & Zisserman, 2016a). However, existing work on models trained end-to-end perform only word classification, rather than sentence-level sequence prediction. Studies have shown that human lipreading performance increases for longer words (Easton & Basala, 1982), indicating the importance of features capturing temporal context in an ambiguous communication channel. Motivated by this observation, we present LipNet, a model that maps a variable-length sequence of video frames to text, making use of spatiotemporal convolutions, a recurrent network, and the connectionist temporal classification loss, trained entirely end-to-end. To the best of our knowledge, LipNet is the first end-to-end sentence-level lipreading model that simultaneously learns spatiotemporal visual features and a sequence model. On the GRID corpus, LipNet achieves 95.2% accuracy in sentence-level, overlapped speaker split task, outperforming experienced human lipreaders and the previous 86.4% word-level state-of-the-art accuracy (Gergen et al., 2016).

Results

TaskDatasetMetricValueModel
LipreadingGRID corpus (mixed-speech)Word Error Rate (WER)4.6LipNet
Natural Language TransductionGRID corpus (mixed-speech)Word Error Rate (WER)4.6LipNet

Related Papers

Learning Speaker-Invariant Visual Features for Lipreading2025-06-09UniCUE: Unified Recognition and Generation Framework for Chinese Cued Speech Video-to-Speech Generation2025-06-04OXSeg: Multidimensional attention UNet-based lip segmentation using semi-supervised lip contours2025-05-08Specialized text classification: an approach to classifying Open Banking transactions2025-04-10Audio-Visual Representation Learning via Knowledge Distillation from Speech Foundation Models2025-02-09Target Speaker Lipreading by Audio-Visual Self-Distillation Pretraining and Speaker Adaptation2025-02-09Evaluation of End-to-End Continuous Spanish Lipreading in Different Data Conditions2025-02-01Universal Training of Neural Networks to Achieve Bayes Optimal Classification Accuracy2025-01-13