TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Knowing What, Where and When to Look: Efficient Video Acti...

Knowing What, Where and When to Look: Efficient Video Action Modeling with Attention

Juan-Manuel Perez-Rua, Brais Martinez, Xiatian Zhu, Antoine Toisoul, Victor Escorcia, Tao Xiang

2020-04-02Action Recognition
PaperPDF

Abstract

Attentive video modeling is essential for action recognition in unconstrained videos due to their rich yet redundant information over space and time. However, introducing attention in a deep neural network for action recognition is challenging for two reasons. First, an effective attention module needs to learn what (objects and their local motion patterns), where (spatially), and when (temporally) to focus on. Second, a video attention module must be efficient because existing action recognition models already suffer from high computational cost. To address both challenges, a novel What-Where-When (W3) video attention module is proposed. Departing from existing alternatives, our W3 module models all three facets of video attention jointly. Crucially, it is extremely efficient by factorizing the high-dimensional video feature data into low-dimensional meaningful spaces (1D channel vector for `what' and 2D spatial tensors for `where'), followed by lightweight temporal attention reasoning. Extensive experiments show that our attention model brings significant improvements to existing action recognition models, achieving new state-of-the-art performance on a number of benchmarks.

Results

TaskDatasetMetricValueModel
Activity RecognitionSomething-Something V1Top 1 Accuracy52.6TSM+W3 (16 frames, ResNet50)
Activity RecognitionSomething-Something V1Top 5 Accuracy81.3TSM+W3 (16 frames, ResNet50)
Activity RecognitionSomething-Something V2Top-1 Accuracy66.5TSM+W3 (16 frames, RGB ResNet-50)
Activity RecognitionSomething-Something V2Top-5 Accuracy90.4TSM+W3 (16 frames, RGB ResNet-50)
Activity RecognitionEPIC-KITCHENS-55Top-1 Accuracy34.2TSM+W3 - full res
Activity RecognitionEgoGestureTop-1 Accuracy94.3TSM+W3
Activity RecognitionEgoGestureTop-5 Accuracy99.2TSM+W3
Action RecognitionSomething-Something V1Top 1 Accuracy52.6TSM+W3 (16 frames, ResNet50)
Action RecognitionSomething-Something V1Top 5 Accuracy81.3TSM+W3 (16 frames, ResNet50)
Action RecognitionSomething-Something V2Top-1 Accuracy66.5TSM+W3 (16 frames, RGB ResNet-50)
Action RecognitionSomething-Something V2Top-5 Accuracy90.4TSM+W3 (16 frames, RGB ResNet-50)
Action RecognitionEPIC-KITCHENS-55Top-1 Accuracy34.2TSM+W3 - full res
Action RecognitionEgoGestureTop-1 Accuracy94.3TSM+W3
Action RecognitionEgoGestureTop-5 Accuracy99.2TSM+W3

Related Papers

A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17Zero-shot Skeleton-based Action Recognition with Prototype-guided Feature Alignment2025-07-01EgoAdapt: Adaptive Multisensory Distillation and Policy Learning for Efficient Egocentric Perception2025-06-26Feature Hallucination for Self-supervised Action Recognition2025-06-25CARMA: Context-Aware Situational Grounding of Human-Robot Group Interactions by Combining Vision-Language Models with Object and Action Recognition2025-06-25Including Semantic Information via Word Embeddings for Skeleton-based Action Recognition2025-06-23Adapting Vision-Language Models for Evaluating World Models2025-06-22Active Multimodal Distillation for Few-shot Action Recognition2025-06-16