TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/End-to-end Temporal Action Detection with Transformer

End-to-end Temporal Action Detection with Transformer

Xiaolong Liu, Qimeng Wang, Yao Hu, Xu Tang, Shiwei Zhang, Song Bai, Xiang Bai

2021-06-18Action DetectionVideo UnderstandingTemporal Action Localization
PaperPDFCode(official)

Abstract

Temporal action detection (TAD) aims to determine the semantic label and the temporal interval of every action instance in an untrimmed video. It is a fundamental and challenging task in video understanding. Previous methods tackle this task with complicated pipelines. They often need to train multiple networks and involve hand-designed operations, such as non-maximal suppression and anchor generation, which limit the flexibility and prevent end-to-end learning. In this paper, we propose an end-to-end Transformer-based method for TAD, termed TadTR. Given a small set of learnable embeddings called action queries, TadTR adaptively extracts temporal context information from the video for each query and directly predicts action instances with the context. To adapt Transformer to TAD, we propose three improvements to enhance its locality awareness. The core is a temporal deformable attention module that selectively attends to a sparse set of key snippets in a video. A segment refinement mechanism and an actionness regression head are designed to refine the boundaries and confidence of the predicted instances, respectively. With such a simple pipeline, TadTR requires lower computation cost than previous detectors, while preserving remarkable performance. As a self-contained detector, it achieves state-of-the-art performance on THUMOS14 (56.7% mAP) and HACS Segments (32.09% mAP). Combined with an extra action classifier, it obtains 36.75% mAP on ActivityNet-1.3. Code is available at https://github.com/xlliu7/TadTR.

Results

TaskDatasetMetricValueModel
VideoHACSAverage-mAP32.09TadTr (I3D RGB)
VideoHACSmAP@0.547.14TadTr (I3D RGB)
VideoHACSmAP@0.7532.11TadTr (I3D RGB)
VideoHACSmAP@0.9510.94TadTr (I3D RGB)
VideoActivityNet-1.3mAP36.75TadTR (TSP features)
VideoActivityNet-1.3mAP IOU@0.553.62TadTR (TSP features)
VideoActivityNet-1.3mAP IOU@0.7537.52TadTR (TSP features)
VideoActivityNet-1.3mAP IOU@0.9510.56TadTR (TSP features)
VideoTHUMOS’14Avg mAP (0.3:0.7)56.7TadTR
VideoTHUMOS’14mAP IOU@0.374.8TadTR
VideoTHUMOS’14mAP IOU@0.469.1TadTR
VideoTHUMOS’14mAP IOU@0.560.1TadTR
VideoTHUMOS’14mAP IOU@0.646.6TadTR
VideoTHUMOS’14mAP IOU@0.732.8TadTR
Temporal Action LocalizationHACSAverage-mAP32.09TadTr (I3D RGB)
Temporal Action LocalizationHACSmAP@0.547.14TadTr (I3D RGB)
Temporal Action LocalizationHACSmAP@0.7532.11TadTr (I3D RGB)
Temporal Action LocalizationHACSmAP@0.9510.94TadTr (I3D RGB)
Temporal Action LocalizationActivityNet-1.3mAP36.75TadTR (TSP features)
Temporal Action LocalizationActivityNet-1.3mAP IOU@0.553.62TadTR (TSP features)
Temporal Action LocalizationActivityNet-1.3mAP IOU@0.7537.52TadTR (TSP features)
Temporal Action LocalizationActivityNet-1.3mAP IOU@0.9510.56TadTR (TSP features)
Temporal Action LocalizationTHUMOS’14Avg mAP (0.3:0.7)56.7TadTR
Temporal Action LocalizationTHUMOS’14mAP IOU@0.374.8TadTR
Temporal Action LocalizationTHUMOS’14mAP IOU@0.469.1TadTR
Temporal Action LocalizationTHUMOS’14mAP IOU@0.560.1TadTR
Temporal Action LocalizationTHUMOS’14mAP IOU@0.646.6TadTR
Temporal Action LocalizationTHUMOS’14mAP IOU@0.732.8TadTR
Zero-Shot LearningHACSAverage-mAP32.09TadTr (I3D RGB)
Zero-Shot LearningHACSmAP@0.547.14TadTr (I3D RGB)
Zero-Shot LearningHACSmAP@0.7532.11TadTr (I3D RGB)
Zero-Shot LearningHACSmAP@0.9510.94TadTr (I3D RGB)
Zero-Shot LearningActivityNet-1.3mAP36.75TadTR (TSP features)
Zero-Shot LearningActivityNet-1.3mAP IOU@0.553.62TadTR (TSP features)
Zero-Shot LearningActivityNet-1.3mAP IOU@0.7537.52TadTR (TSP features)
Zero-Shot LearningActivityNet-1.3mAP IOU@0.9510.56TadTR (TSP features)
Zero-Shot LearningTHUMOS’14Avg mAP (0.3:0.7)56.7TadTR
Zero-Shot LearningTHUMOS’14mAP IOU@0.374.8TadTR
Zero-Shot LearningTHUMOS’14mAP IOU@0.469.1TadTR
Zero-Shot LearningTHUMOS’14mAP IOU@0.560.1TadTR
Zero-Shot LearningTHUMOS’14mAP IOU@0.646.6TadTR
Zero-Shot LearningTHUMOS’14mAP IOU@0.732.8TadTR
Action LocalizationHACSAverage-mAP32.09TadTr (I3D RGB)
Action LocalizationHACSmAP@0.547.14TadTr (I3D RGB)
Action LocalizationHACSmAP@0.7532.11TadTr (I3D RGB)
Action LocalizationHACSmAP@0.9510.94TadTr (I3D RGB)
Action LocalizationActivityNet-1.3mAP36.75TadTR (TSP features)
Action LocalizationActivityNet-1.3mAP IOU@0.553.62TadTR (TSP features)
Action LocalizationActivityNet-1.3mAP IOU@0.7537.52TadTR (TSP features)
Action LocalizationActivityNet-1.3mAP IOU@0.9510.56TadTR (TSP features)
Action LocalizationTHUMOS’14Avg mAP (0.3:0.7)56.7TadTR
Action LocalizationTHUMOS’14mAP IOU@0.374.8TadTR
Action LocalizationTHUMOS’14mAP IOU@0.469.1TadTR
Action LocalizationTHUMOS’14mAP IOU@0.560.1TadTR
Action LocalizationTHUMOS’14mAP IOU@0.646.6TadTR
Action LocalizationTHUMOS’14mAP IOU@0.732.8TadTR

Related Papers

VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding2025-07-17DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action Recognition2025-07-16UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks2025-07-15EmbRACE-3K: Embodied Reasoning and Action in Complex Environments2025-07-14Chat with AI: The Surprising Turn of Real-time Video Communication from Human to AI2025-07-14Beyond Appearance: Geometric Cues for Robust Video Instance Segmentation2025-07-08Omni-Video: Democratizing Unified Video Understanding and Generation2025-07-08MCAM: Multimodal Causal Analysis Model for Ego-Vehicle-Level Driving Video Understanding2025-07-08