TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Video Action Transformer Network

Video Action Transformer Network

Rohit Girdhar, João Carreira, Carl Doersch, Andrew Zisserman

2018-12-06CVPR 2019 6Recognizing And Localizing Human ActionsAction Recognition
PaperPDF

Abstract

We introduce the Action Transformer model for recognizing and localizing human actions in video clips. We repurpose a Transformer-style architecture to aggregate features from the spatiotemporal context around the person whose actions we are trying to classify. We show that by using high-resolution, person-specific, class-agnostic queries, the model spontaneously learns to track individual people and to pick up on semantic context from the actions of others. Additionally its attention mechanism learns to emphasize hands and faces, which are often crucial to discriminate an action - all without explicit supervision other than boxes and class labels. We train and test our Action Transformer network on the Atomic Visual Actions (AVA) dataset, outperforming the state-of-the-art by a significant margin using only raw RGB frames as input.

Results

TaskDatasetMetricValueModel
Activity RecognitionAVA v2.1GFlops39.6I3D Tx HighRes
Activity RecognitionAVA v2.1Params (M)19.3I3D Tx HighRes
Activity RecognitionAVA v2.1mAP (Val)27.6I3D Tx HighRes
Activity RecognitionAVA v2.1GFlops6.5I3D I3D
Activity RecognitionAVA v2.1Params (M)16.2I3D I3D
Activity RecognitionAVA v2.1mAP (Val)23.4I3D I3D
Action RecognitionAVA v2.1GFlops39.6I3D Tx HighRes
Action RecognitionAVA v2.1Params (M)19.3I3D Tx HighRes
Action RecognitionAVA v2.1mAP (Val)27.6I3D Tx HighRes
Action RecognitionAVA v2.1GFlops6.5I3D I3D
Action RecognitionAVA v2.1Params (M)16.2I3D I3D
Action RecognitionAVA v2.1mAP (Val)23.4I3D I3D

Related Papers

A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17Zero-shot Skeleton-based Action Recognition with Prototype-guided Feature Alignment2025-07-01EgoAdapt: Adaptive Multisensory Distillation and Policy Learning for Efficient Egocentric Perception2025-06-26Feature Hallucination for Self-supervised Action Recognition2025-06-25CARMA: Context-Aware Situational Grounding of Human-Robot Group Interactions by Combining Vision-Language Models with Object and Action Recognition2025-06-25Including Semantic Information via Word Embeddings for Skeleton-based Action Recognition2025-06-23Adapting Vision-Language Models for Evaluating World Models2025-06-22Active Multimodal Distillation for Few-shot Action Recognition2025-06-16