TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Visual Keyword Spotting with Attention

Visual Keyword Spotting with Attention

K R Prajwal, Liliane Momeni, Triantafyllos Afouras, Andrew Zisserman

2021-10-29Lip ReadingVisual Keyword Spotting
PaperPDFCode(official)

Abstract

In this paper, we consider the task of spotting spoken keywords in silent video sequences -- also known as visual keyword spotting. To this end, we investigate Transformer-based models that ingest two streams, a visual encoding of the video and a phonetic encoding of the keyword, and output the temporal location of the keyword if present. Our contributions are as follows: (1) We propose a novel architecture, the Transpotter, that uses full cross-modal attention between the visual and phonetic streams; (2) We show through extensive evaluations that our model outperforms the prior state-of-the-art visual keyword spotting and lip reading methods on the challenging LRW, LRS2, LRS3 datasets by a large margin; (3) We demonstrate the ability of our model to spot words under the extreme conditions of isolated mouthings in sign language videos.

Results

TaskDatasetMetricValueModel
Keyword SpottingLRS3-TEDTop-1 Accuracy52Transpotter
Keyword SpottingLRS3-TEDTop-5 Accuracy77.1Transpotter
Keyword SpottingLRS3-TEDmAP55.4Transpotter
Keyword SpottingLRS3-TEDmAP IOU@0.553.6Transpotter
Keyword SpottingLRS2Top-1 Accuracy65Transpotter
Keyword SpottingLRS2Top-5 Accuracy87.1Transpotter
Keyword SpottingLRS2mAP69.2Transpotter
Keyword SpottingLRS2mAP IOU@0.568.3Transpotter
Keyword SpottingLRWTop-1 Accuracy85.8Transpotter
Keyword SpottingLRWTop-5 Accuracy99.6Transpotter
Keyword SpottingLRWmAP64.1Transpotter

Related Papers

VisualSpeaker: Visually-Guided 3D Avatar Lip Synthesis2025-07-08SwinLip: An Efficient Visual Speech Encoder for Lip Reading Using Swin Transformer2025-05-07Transforming faces into video stories -- VideoFace2.02025-05-04Development and evaluation of a deep learning algorithm for German word recognition from lip movements2025-04-22Chinese-LiPS: A Chinese audio-visual speech recognition dataset with Lip-reading and Presentation Slides2025-04-21VALLR: Visual ASR Language Model for Lip Reading2025-03-27Lend a Hand: Semi Training-Free Cued Speech Recognition via MLLM-Driven Hand Modeling for Barrier-free Communication2025-03-11Integrating Persian Lip Reading in Surena-V Humanoid Robot for Human-Robot Interaction2025-01-23