TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/ActionFormer: Localizing Moments of Actions with Transform...

ActionFormer: Localizing Moments of Actions with Transformers

Chenlin Zhang, Jianxin Wu, Yin Li

2022-02-16Action Localizationaudio-visual event localizationVideo UnderstandingAction RecognitionTemporal Action Localizationobject-detection
PaperPDFCode(official)

Abstract

Self-attention based Transformer models have demonstrated impressive results for image classification and object detection, and more recently for video understanding. Inspired by this success, we investigate the application of Transformer networks for temporal action localization in videos. To this end, we present ActionFormer -- a simple yet powerful model to identify actions in time and recognize their categories in a single shot, without using action proposals or relying on pre-defined anchor windows. ActionFormer combines a multiscale feature representation with local self-attention, and uses a light-weighted decoder to classify every moment in time and estimate the corresponding action boundaries. We show that this orchestrated design results in major improvements upon prior works. Without bells and whistles, ActionFormer achieves 71.0% mAP at tIoU=0.5 on THUMOS14, outperforming the best prior model by 14.1 absolute percentage points. Further, ActionFormer demonstrates strong results on ActivityNet 1.3 (36.6% average mAP) and EPIC-Kitchens 100 (+13.5% average mAP over prior works). Our code is available at http://github.com/happyharrycn/actionformer_release.

Results

TaskDatasetMetricValueModel
VideoActivityNet-1.3mAP36.6ActionFormer (TSP feautures)
VideoActivityNet-1.3mAP IOU@0.554.7ActionFormer (TSP feautures)
VideoActivityNet-1.3mAP IOU@0.7537.8ActionFormer (TSP feautures)
VideoActivityNet-1.3mAP IOU@0.958.4ActionFormer (TSP feautures)
VideoTHUMOS’14Avg mAP (0.3:0.7)66.8ActionFormer (I3D features)
VideoTHUMOS’14mAP IOU@0.382.1ActionFormer (I3D features)
VideoTHUMOS’14mAP IOU@0.477.8ActionFormer (I3D features)
VideoTHUMOS’14mAP IOU@0.571ActionFormer (I3D features)
VideoTHUMOS’14mAP IOU@0.659.4ActionFormer (I3D features)
VideoTHUMOS’14mAP IOU@0.743.9ActionFormer (I3D features)
VideoEPIC-KITCHENS-100Avg mAP (0.1-0.5)23.5ActionFormer (verb)
VideoEPIC-KITCHENS-100mAP IOU@0.126.6ActionFormer (verb)
VideoEPIC-KITCHENS-100mAP IOU@0.225.4ActionFormer (verb)
VideoEPIC-KITCHENS-100mAP IOU@0.324.2ActionFormer (verb)
VideoEPIC-KITCHENS-100mAP IOU@0.422.3ActionFormer (verb)
VideoEPIC-KITCHENS-100mAP IOU@0.519.1ActionFormer (verb)
Temporal Action LocalizationActivityNet-1.3mAP36.6ActionFormer (TSP feautures)
Temporal Action LocalizationActivityNet-1.3mAP IOU@0.554.7ActionFormer (TSP feautures)
Temporal Action LocalizationActivityNet-1.3mAP IOU@0.7537.8ActionFormer (TSP feautures)
Temporal Action LocalizationActivityNet-1.3mAP IOU@0.958.4ActionFormer (TSP feautures)
Temporal Action LocalizationTHUMOS’14Avg mAP (0.3:0.7)66.8ActionFormer (I3D features)
Temporal Action LocalizationTHUMOS’14mAP IOU@0.382.1ActionFormer (I3D features)
Temporal Action LocalizationTHUMOS’14mAP IOU@0.477.8ActionFormer (I3D features)
Temporal Action LocalizationTHUMOS’14mAP IOU@0.571ActionFormer (I3D features)
Temporal Action LocalizationTHUMOS’14mAP IOU@0.659.4ActionFormer (I3D features)
Temporal Action LocalizationTHUMOS’14mAP IOU@0.743.9ActionFormer (I3D features)
Temporal Action LocalizationEPIC-KITCHENS-100Avg mAP (0.1-0.5)23.5ActionFormer (verb)
Temporal Action LocalizationEPIC-KITCHENS-100mAP IOU@0.126.6ActionFormer (verb)
Temporal Action LocalizationEPIC-KITCHENS-100mAP IOU@0.225.4ActionFormer (verb)
Temporal Action LocalizationEPIC-KITCHENS-100mAP IOU@0.324.2ActionFormer (verb)
Temporal Action LocalizationEPIC-KITCHENS-100mAP IOU@0.422.3ActionFormer (verb)
Temporal Action LocalizationEPIC-KITCHENS-100mAP IOU@0.519.1ActionFormer (verb)
Zero-Shot LearningActivityNet-1.3mAP36.6ActionFormer (TSP feautures)
Zero-Shot LearningActivityNet-1.3mAP IOU@0.554.7ActionFormer (TSP feautures)
Zero-Shot LearningActivityNet-1.3mAP IOU@0.7537.8ActionFormer (TSP feautures)
Zero-Shot LearningActivityNet-1.3mAP IOU@0.958.4ActionFormer (TSP feautures)
Zero-Shot LearningTHUMOS’14Avg mAP (0.3:0.7)66.8ActionFormer (I3D features)
Zero-Shot LearningTHUMOS’14mAP IOU@0.382.1ActionFormer (I3D features)
Zero-Shot LearningTHUMOS’14mAP IOU@0.477.8ActionFormer (I3D features)
Zero-Shot LearningTHUMOS’14mAP IOU@0.571ActionFormer (I3D features)
Zero-Shot LearningTHUMOS’14mAP IOU@0.659.4ActionFormer (I3D features)
Zero-Shot LearningTHUMOS’14mAP IOU@0.743.9ActionFormer (I3D features)
Zero-Shot LearningEPIC-KITCHENS-100Avg mAP (0.1-0.5)23.5ActionFormer (verb)
Zero-Shot LearningEPIC-KITCHENS-100mAP IOU@0.126.6ActionFormer (verb)
Zero-Shot LearningEPIC-KITCHENS-100mAP IOU@0.225.4ActionFormer (verb)
Zero-Shot LearningEPIC-KITCHENS-100mAP IOU@0.324.2ActionFormer (verb)
Zero-Shot LearningEPIC-KITCHENS-100mAP IOU@0.422.3ActionFormer (verb)
Zero-Shot LearningEPIC-KITCHENS-100mAP IOU@0.519.1ActionFormer (verb)
audio-visual event localizationUnAV-100 mAP42.2ActionFormer
audio-visual event localizationUnAV-100AP@IOU0.543.5ActionFormer
Action LocalizationActivityNet-1.3mAP36.6ActionFormer (TSP feautures)
Action LocalizationActivityNet-1.3mAP IOU@0.554.7ActionFormer (TSP feautures)
Action LocalizationActivityNet-1.3mAP IOU@0.7537.8ActionFormer (TSP feautures)
Action LocalizationActivityNet-1.3mAP IOU@0.958.4ActionFormer (TSP feautures)
Action LocalizationTHUMOS’14Avg mAP (0.3:0.7)66.8ActionFormer (I3D features)
Action LocalizationTHUMOS’14mAP IOU@0.382.1ActionFormer (I3D features)
Action LocalizationTHUMOS’14mAP IOU@0.477.8ActionFormer (I3D features)
Action LocalizationTHUMOS’14mAP IOU@0.571ActionFormer (I3D features)
Action LocalizationTHUMOS’14mAP IOU@0.659.4ActionFormer (I3D features)
Action LocalizationTHUMOS’14mAP IOU@0.743.9ActionFormer (I3D features)
Action LocalizationEPIC-KITCHENS-100Avg mAP (0.1-0.5)23.5ActionFormer (verb)
Action LocalizationEPIC-KITCHENS-100mAP IOU@0.126.6ActionFormer (verb)
Action LocalizationEPIC-KITCHENS-100mAP IOU@0.225.4ActionFormer (verb)
Action LocalizationEPIC-KITCHENS-100mAP IOU@0.324.2ActionFormer (verb)
Action LocalizationEPIC-KITCHENS-100mAP IOU@0.422.3ActionFormer (verb)
Action LocalizationEPIC-KITCHENS-100mAP IOU@0.519.1ActionFormer (verb)

Related Papers

VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding2025-07-17A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17RS-TinyNet: Stage-wise Feature Fusion Network for Detecting Tiny Objects in Remote Sensing Images2025-07-17Decoupled PROB: Decoupled Query Initialization Tasks and Objectness-Class Learning for Open World Object Detection2025-07-17Dual LiDAR-Based Traffic Movement Count Estimation at a Signalized Intersection: Deployment, Data Collection, and Preliminary Analysis2025-07-17DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action Recognition2025-07-16Vision-based Perception for Autonomous Vehicles in Obstacle Avoidance Scenarios2025-07-16UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks2025-07-15