TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Action Sensitivity Learning for Temporal Action Localization

Action Sensitivity Learning for Temporal Action Localization

Jiayi Shao, Xiaohan Wang, Ruijie Quan, Junjun Zheng, Jiang Yang, Yi Yang

2023-05-25ICCV 2023 1Action LocalizationMoment QueriesVideo UnderstandingTemporal Action Localization
PaperPDF

Abstract

Temporal action localization (TAL), which involves recognizing and locating action instances, is a challenging task in video understanding. Most existing approaches directly predict action classes and regress offsets to boundaries, while overlooking the discrepant importance of each frame. In this paper, we propose an Action Sensitivity Learning framework (ASL) to tackle this task, which aims to assess the value of each frame and then leverage the generated action sensitivity to recalibrate the training procedure. We first introduce a lightweight Action Sensitivity Evaluator to learn the action sensitivity at the class level and instance level, respectively. The outputs of the two branches are combined to reweight the gradient of the two sub-tasks. Moreover, based on the action sensitivity of each frame, we design an Action Sensitive Contrastive Loss to enhance features, where the action-aware frames are sampled as positive pairs to push away the action-irrelevant frames. The extensive studies on various action localization benchmarks (i.e., MultiThumos, Charades, Ego4D-Moment Queries v1.0, Epic-Kitchens 100, Thumos14 and ActivityNet1.3) show that ASL surpasses the state-of-the-art in terms of average-mAP under multiple types of scenarios, e.g., single-labeled, densely-labeled and egocentric.

Results

TaskDatasetMetricValueModel
VideoTHUMOS’14Avg mAP (0.3:0.7)67.9ASL(I3D features)
VideoTHUMOS’14mAP IOU@0.383.1ASL(I3D features)
VideoTHUMOS’14mAP IOU@0.479ASL(I3D features)
VideoTHUMOS’14mAP IOU@0.571.7ASL(I3D features)
VideoTHUMOS’14mAP IOU@0.659.7ASL(I3D features)
VideoTHUMOS’14mAP IOU@0.745.8ASL(I3D features)
Temporal Action LocalizationTHUMOS’14Avg mAP (0.3:0.7)67.9ASL(I3D features)
Temporal Action LocalizationTHUMOS’14mAP IOU@0.383.1ASL(I3D features)
Temporal Action LocalizationTHUMOS’14mAP IOU@0.479ASL(I3D features)
Temporal Action LocalizationTHUMOS’14mAP IOU@0.571.7ASL(I3D features)
Temporal Action LocalizationTHUMOS’14mAP IOU@0.659.7ASL(I3D features)
Temporal Action LocalizationTHUMOS’14mAP IOU@0.745.8ASL(I3D features)
Zero-Shot LearningTHUMOS’14Avg mAP (0.3:0.7)67.9ASL(I3D features)
Zero-Shot LearningTHUMOS’14mAP IOU@0.383.1ASL(I3D features)
Zero-Shot LearningTHUMOS’14mAP IOU@0.479ASL(I3D features)
Zero-Shot LearningTHUMOS’14mAP IOU@0.571.7ASL(I3D features)
Zero-Shot LearningTHUMOS’14mAP IOU@0.659.7ASL(I3D features)
Zero-Shot LearningTHUMOS’14mAP IOU@0.745.8ASL(I3D features)
Action LocalizationTHUMOS’14Avg mAP (0.3:0.7)67.9ASL(I3D features)
Action LocalizationTHUMOS’14mAP IOU@0.383.1ASL(I3D features)
Action LocalizationTHUMOS’14mAP IOU@0.479ASL(I3D features)
Action LocalizationTHUMOS’14mAP IOU@0.571.7ASL(I3D features)
Action LocalizationTHUMOS’14mAP IOU@0.659.7ASL(I3D features)
Action LocalizationTHUMOS’14mAP IOU@0.745.8ASL(I3D features)

Related Papers

VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding2025-07-17DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action Recognition2025-07-16UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks2025-07-15EmbRACE-3K: Embodied Reasoning and Action in Complex Environments2025-07-14Chat with AI: The Surprising Turn of Real-time Video Communication from Human to AI2025-07-14Beyond Appearance: Geometric Cues for Robust Video Instance Segmentation2025-07-08Omni-Video: Democratizing Unified Video Understanding and Generation2025-07-08MCAM: Multimodal Causal Analysis Model for Ego-Vehicle-Level Driving Video Understanding2025-07-08