TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Weakly-Supervised Temporal Action Localization by Progress...

Weakly-Supervised Temporal Action Localization by Progressive Complementary Learning

Jia-Run Du, Jia-Chang Feng, Kun-Yu Lin, Fa-Ting Hong, Xiao-Ming Wu, Zhongang Qi, Ying Shan, Wei-Shi Zheng

2022-06-22Weakly Supervised Action LocalizationRepresentation LearningAction LocalizationMultiple Instance LearningWeakly-supervised Temporal Action LocalizationTemporal Action Localization
PaperPDFCodeCode(official)

Abstract

Weakly Supervised Temporal Action Localization (WSTAL) aims to localize and classify action instances in long untrimmed videos with only video-level category labels. Due to the lack of snippet-level supervision for indicating action boundaries, previous methods typically assign pseudo labels for unlabeled snippets. However, since some action instances of different categories are visually similar, it is non-trivial to exactly label the (usually) one action category for a snippet, and incorrect pseudo labels would impair the localization performance. To address this problem, we propose a novel method from a category exclusion perspective, named Progressive Complementary Learning (ProCL), which gradually enhances the snippet-level supervision. Our method is inspired by the fact that video-level labels precisely indicate the categories that all snippets surely do not belong to, which is ignored by previous works. Accordingly, we first exclude these surely non-existent categories by a complementary learning loss. And then, we introduce the background-aware pseudo complementary labeling in order to exclude more categories for snippets of less ambiguity. Furthermore, for the remaining ambiguous snippets, we attempt to reduce the ambiguity by distinguishing foreground actions from the background. Extensive experimental results show that our method achieves new state-of-the-art performance on two popular benchmarks, namely THUMOS14 and ActivityNet1.3.

Results

TaskDatasetMetricValueModel
VideoTHUMOS' 14avg-mAP (0.1:0.7)47.7ProCL
VideoTHUMOS’14avg-mAP (0.1-0.5)58.2ProCL
VideoTHUMOS’14mAP@0.540.5ProCL
VideoActivityNet-1.3mAP@0.5:0.9526.1ProCL
Temporal Action LocalizationTHUMOS' 14avg-mAP (0.1:0.7)47.7ProCL
Temporal Action LocalizationTHUMOS’14avg-mAP (0.1-0.5)58.2ProCL
Temporal Action LocalizationTHUMOS’14mAP@0.540.5ProCL
Temporal Action LocalizationActivityNet-1.3mAP@0.5:0.9526.1ProCL
Zero-Shot LearningTHUMOS' 14avg-mAP (0.1:0.7)47.7ProCL
Zero-Shot LearningTHUMOS’14avg-mAP (0.1-0.5)58.2ProCL
Zero-Shot LearningTHUMOS’14mAP@0.540.5ProCL
Zero-Shot LearningActivityNet-1.3mAP@0.5:0.9526.1ProCL
Action LocalizationTHUMOS' 14avg-mAP (0.1:0.7)47.7ProCL
Action LocalizationTHUMOS’14avg-mAP (0.1-0.5)58.2ProCL
Action LocalizationTHUMOS’14mAP@0.540.5ProCL
Action LocalizationActivityNet-1.3mAP@0.5:0.9526.1ProCL
Weakly Supervised Action LocalizationTHUMOS' 14avg-mAP (0.1:0.7)47.7ProCL
Weakly Supervised Action LocalizationTHUMOS’14avg-mAP (0.1-0.5)58.2ProCL
Weakly Supervised Action LocalizationTHUMOS’14mAP@0.540.5ProCL
Weakly Supervised Action LocalizationActivityNet-1.3mAP@0.5:0.9526.1ProCL

Related Papers

Touch in the Wild: Learning Fine-Grained Manipulation with a Portable Visuo-Tactile Gripper2025-07-20Spectral Bellman Method: Unifying Representation and Exploration in RL2025-07-17Boosting Team Modeling through Tempo-Relational Representation Learning2025-07-17Similarity-Guided Diffusion for Contrastive Sequential Recommendation2025-07-16Are encoders able to learn landmarkers for warm-starting of Hyperparameter Optimization?2025-07-16Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action Recognition2025-07-16A Mixed-Primitive-based Gaussian Splatting Method for Surface Reconstruction2025-07-15