TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/An Empirical Study of End-to-End Temporal Action Detection

An Empirical Study of End-to-End Temporal Action Detection

Xiaolong Liu, Song Bai, Xiang Bai

2022-04-06CVPR 2022 1Action DetectionAction ClassificationVideo UnderstandingTemporal Action Localization
PaperPDFCode(official)

Abstract

Temporal action detection (TAD) is an important yet challenging task in video understanding. It aims to simultaneously predict the semantic label and the temporal interval of every action instance in an untrimmed video. Rather than end-to-end learning, most existing methods adopt a head-only learning paradigm, where the video encoder is pre-trained for action classification, and only the detection head upon the encoder is optimized for TAD. The effect of end-to-end learning is not systematically evaluated. Besides, there lacks an in-depth study on the efficiency-accuracy trade-off in end-to-end TAD. In this paper, we present an empirical study of end-to-end temporal action detection. We validate the advantage of end-to-end learning over head-only learning and observe up to 11\% performance improvement. Besides, we study the effects of multiple design choices that affect the TAD performance and speed, including detection head, video encoder, and resolution of input videos. Based on the findings, we build a mid-resolution baseline detector, which achieves the state-of-the-art performance of end-to-end methods while running more than 4$\times$ faster. We hope that this paper can serve as a guide for end-to-end learning and inspire future research in this field. Code and models are available at \url{https://github.com/xlliu7/E2E-TAD}.

Results

TaskDatasetMetricValueModel
VideoActivityNet-1.3mAP35.1E2E-TAD (SlowFast R50+TadTR)
VideoActivityNet-1.3mAP IOU@0.550.47E2E-TAD (SlowFast R50+TadTR)
VideoActivityNet-1.3mAP IOU@0.7535.99E2E-TAD (SlowFast R50+TadTR)
VideoActivityNet-1.3mAP IOU@0.9510.83E2E-TAD (SlowFast R50+TadTR)
VideoTHUMOS’14Avg mAP (0.3:0.7)54.2E2E-TAD (SlowFast R50+TadTR)
VideoTHUMOS’14mAP IOU@0.369.4E2E-TAD (SlowFast R50+TadTR)
VideoTHUMOS’14mAP IOU@0.464.3E2E-TAD (SlowFast R50+TadTR)
VideoTHUMOS’14mAP IOU@0.556E2E-TAD (SlowFast R50+TadTR)
VideoTHUMOS’14mAP IOU@0.646.4E2E-TAD (SlowFast R50+TadTR)
VideoTHUMOS’14mAP IOU@0.734.9E2E-TAD (SlowFast R50+TadTR)
Temporal Action LocalizationActivityNet-1.3mAP35.1E2E-TAD (SlowFast R50+TadTR)
Temporal Action LocalizationActivityNet-1.3mAP IOU@0.550.47E2E-TAD (SlowFast R50+TadTR)
Temporal Action LocalizationActivityNet-1.3mAP IOU@0.7535.99E2E-TAD (SlowFast R50+TadTR)
Temporal Action LocalizationActivityNet-1.3mAP IOU@0.9510.83E2E-TAD (SlowFast R50+TadTR)
Temporal Action LocalizationTHUMOS’14Avg mAP (0.3:0.7)54.2E2E-TAD (SlowFast R50+TadTR)
Temporal Action LocalizationTHUMOS’14mAP IOU@0.369.4E2E-TAD (SlowFast R50+TadTR)
Temporal Action LocalizationTHUMOS’14mAP IOU@0.464.3E2E-TAD (SlowFast R50+TadTR)
Temporal Action LocalizationTHUMOS’14mAP IOU@0.556E2E-TAD (SlowFast R50+TadTR)
Temporal Action LocalizationTHUMOS’14mAP IOU@0.646.4E2E-TAD (SlowFast R50+TadTR)
Temporal Action LocalizationTHUMOS’14mAP IOU@0.734.9E2E-TAD (SlowFast R50+TadTR)
Zero-Shot LearningActivityNet-1.3mAP35.1E2E-TAD (SlowFast R50+TadTR)
Zero-Shot LearningActivityNet-1.3mAP IOU@0.550.47E2E-TAD (SlowFast R50+TadTR)
Zero-Shot LearningActivityNet-1.3mAP IOU@0.7535.99E2E-TAD (SlowFast R50+TadTR)
Zero-Shot LearningActivityNet-1.3mAP IOU@0.9510.83E2E-TAD (SlowFast R50+TadTR)
Zero-Shot LearningTHUMOS’14Avg mAP (0.3:0.7)54.2E2E-TAD (SlowFast R50+TadTR)
Zero-Shot LearningTHUMOS’14mAP IOU@0.369.4E2E-TAD (SlowFast R50+TadTR)
Zero-Shot LearningTHUMOS’14mAP IOU@0.464.3E2E-TAD (SlowFast R50+TadTR)
Zero-Shot LearningTHUMOS’14mAP IOU@0.556E2E-TAD (SlowFast R50+TadTR)
Zero-Shot LearningTHUMOS’14mAP IOU@0.646.4E2E-TAD (SlowFast R50+TadTR)
Zero-Shot LearningTHUMOS’14mAP IOU@0.734.9E2E-TAD (SlowFast R50+TadTR)
Action LocalizationActivityNet-1.3mAP35.1E2E-TAD (SlowFast R50+TadTR)
Action LocalizationActivityNet-1.3mAP IOU@0.550.47E2E-TAD (SlowFast R50+TadTR)
Action LocalizationActivityNet-1.3mAP IOU@0.7535.99E2E-TAD (SlowFast R50+TadTR)
Action LocalizationActivityNet-1.3mAP IOU@0.9510.83E2E-TAD (SlowFast R50+TadTR)
Action LocalizationTHUMOS’14Avg mAP (0.3:0.7)54.2E2E-TAD (SlowFast R50+TadTR)
Action LocalizationTHUMOS’14mAP IOU@0.369.4E2E-TAD (SlowFast R50+TadTR)
Action LocalizationTHUMOS’14mAP IOU@0.464.3E2E-TAD (SlowFast R50+TadTR)
Action LocalizationTHUMOS’14mAP IOU@0.556E2E-TAD (SlowFast R50+TadTR)
Action LocalizationTHUMOS’14mAP IOU@0.646.4E2E-TAD (SlowFast R50+TadTR)
Action LocalizationTHUMOS’14mAP IOU@0.734.9E2E-TAD (SlowFast R50+TadTR)

Related Papers

VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding2025-07-17DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action Recognition2025-07-16UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks2025-07-15EmbRACE-3K: Embodied Reasoning and Action in Complex Environments2025-07-14Chat with AI: The Surprising Turn of Real-time Video Communication from Human to AI2025-07-14Beyond Appearance: Geometric Cues for Robust Video Instance Segmentation2025-07-08Omni-Video: Democratizing Unified Video Understanding and Generation2025-07-08MCAM: Multimodal Causal Analysis Model for Ego-Vehicle-Level Driving Video Understanding2025-07-08