Visual Keyword Spotting with Attention

K R Prajwal, Liliane Momeni, Triantafyllos Afouras, Andrew Zisserman

2021-10-29Lip Reading Visual Keyword Spotting

Abstract

In this paper, we consider the task of spotting spoken keywords in silent video sequences -- also known as visual keyword spotting. To this end, we investigate Transformer-based models that ingest two streams, a visual encoding of the video and a phonetic encoding of the keyword, and output the temporal location of the keyword if present. Our contributions are as follows: (1) We propose a novel architecture, the Transpotter, that uses full cross-modal attention between the visual and phonetic streams; (2) We show through extensive evaluations that our model outperforms the prior state-of-the-art visual keyword spotting and lip reading methods on the challenging LRW, LRS2, LRS3 datasets by a large margin; (3) We demonstrate the ability of our model to spot words under the extreme conditions of isolated mouthings in sign language videos.

Results

Task	Dataset	Metric	Value	Model
Keyword Spotting	LRS3-TED	Top-1 Accuracy	52	Transpotter
Keyword Spotting	LRS3-TED	Top-5 Accuracy	77.1	Transpotter
Keyword Spotting	LRS3-TED	mAP	55.4	Transpotter
Keyword Spotting	LRS3-TED	mAP IOU@0.5	53.6	Transpotter
Keyword Spotting	LRS2	Top-1 Accuracy	65	Transpotter
Keyword Spotting	LRS2	Top-5 Accuracy	87.1	Transpotter
Keyword Spotting	LRS2	mAP	69.2	Transpotter
Keyword Spotting	LRS2	mAP IOU@0.5	68.3	Transpotter
Keyword Spotting	LRW	Top-1 Accuracy	85.8	Transpotter
Keyword Spotting	LRW	Top-5 Accuracy	99.6	Transpotter
Keyword Spotting	LRW	mAP	64.1	Transpotter

Related Papers

VisualSpeaker: Visually-Guided 3D Avatar Lip Synthesis2025-07-08 SwinLip: An Efficient Visual Speech Encoder for Lip Reading Using Swin Transformer2025-05-07 Transforming faces into video stories -- VideoFace2.02025-05-04 Development and evaluation of a deep learning algorithm for German word recognition from lip movements2025-04-22 Chinese-LiPS: A Chinese audio-visual speech recognition dataset with Lip-reading and Presentation Slides2025-04-21 VALLR: Visual ASR Language Model for Lip Reading2025-03-27 Lend a Hand: Semi Training-Free Cued Speech Recognition via MLLM-Driven Hand Modeling for Barrier-free Communication2025-03-11 Integrating Persian Lip Reading in Surena-V Humanoid Robot for Human-Robot Interaction2025-01-23