K R Prajwal, Liliane Momeni, Triantafyllos Afouras, Andrew Zisserman
In this paper, we consider the task of spotting spoken keywords in silent video sequences -- also known as visual keyword spotting. To this end, we investigate Transformer-based models that ingest two streams, a visual encoding of the video and a phonetic encoding of the keyword, and output the temporal location of the keyword if present. Our contributions are as follows: (1) We propose a novel architecture, the Transpotter, that uses full cross-modal attention between the visual and phonetic streams; (2) We show through extensive evaluations that our model outperforms the prior state-of-the-art visual keyword spotting and lip reading methods on the challenging LRW, LRS2, LRS3 datasets by a large margin; (3) We demonstrate the ability of our model to spot words under the extreme conditions of isolated mouthings in sign language videos.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Keyword Spotting | LRS3-TED | Top-1 Accuracy | 52 | Transpotter |
| Keyword Spotting | LRS3-TED | Top-5 Accuracy | 77.1 | Transpotter |
| Keyword Spotting | LRS3-TED | mAP | 55.4 | Transpotter |
| Keyword Spotting | LRS3-TED | mAP IOU@0.5 | 53.6 | Transpotter |
| Keyword Spotting | LRS2 | Top-1 Accuracy | 65 | Transpotter |
| Keyword Spotting | LRS2 | Top-5 Accuracy | 87.1 | Transpotter |
| Keyword Spotting | LRS2 | mAP | 69.2 | Transpotter |
| Keyword Spotting | LRS2 | mAP IOU@0.5 | 68.3 | Transpotter |
| Keyword Spotting | LRW | Top-1 Accuracy | 85.8 | Transpotter |
| Keyword Spotting | LRW | Top-5 Accuracy | 99.6 | Transpotter |
| Keyword Spotting | LRW | mAP | 64.1 | Transpotter |