TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Proposal-Based Multiple Instance Learning for Weakly-Super...

Proposal-Based Multiple Instance Learning for Weakly-Supervised Temporal Action Localization

Huan Ren, Wenfei Yang, Tianzhu Zhang, Yongdong Zhang

2023-05-29CVPR 2023 1Weakly Supervised Action LocalizationAction LocalizationMultiple Instance LearningWeakly-supervised Temporal Action LocalizationTemporal Action Localization
PaperPDFCode(official)

Abstract

Weakly-supervised temporal action localization aims to localize and recognize actions in untrimmed videos with only video-level category labels during training. Without instance-level annotations, most existing methods follow the Segment-based Multiple Instance Learning (S-MIL) framework, where the predictions of segments are supervised by the labels of videos. However, the objective for acquiring segment-level scores during training is not consistent with the target for acquiring proposal-level scores during testing, leading to suboptimal results. To deal with this problem, we propose a novel Proposal-based Multiple Instance Learning (P-MIL) framework that directly classifies the candidate proposals in both the training and testing stages, which includes three key designs: 1) a surrounding contrastive feature extraction module to suppress the discriminative short proposals by considering the surrounding contrastive information, 2) a proposal completeness evaluation module to inhibit the low-quality proposals with the guidance of the completeness pseudo labels, and 3) an instance-level rank consistency loss to achieve robust detection by leveraging the complementarity of RGB and FLOW modalities. Extensive experimental results on two challenging benchmarks including THUMOS14 and ActivityNet demonstrate the superior performance of our method.

Results

TaskDatasetMetricValueModel
VideoTHUMOS 2014mAP@0.1:0.557.4P-MIL
VideoTHUMOS 2014mAP@0.1:0.747P-MIL
VideoTHUMOS 2014mAP@0.540P-MIL
VideoTHUMOS14avg-mAP (0.1-0.5)57.4P-MIL
VideoTHUMOS14avg-mAP (0.1:0.7)47P-MIL
VideoTHUMOS14avg-mAP (0.3-0.7)38P-MIL
VideoTHUMOS’14mAP@0.540P-MIL
VideoActivityNet-1.3mAP@0.541.8P-MIL
VideoActivityNet-1.3mAP@0.5:0.9525.5P-MIL
VideoActivityNet-1.2Mean mAP26.5P-MIL
VideoActivityNet-1.2mAP@0.544.2P-MIL
Temporal Action LocalizationTHUMOS 2014mAP@0.1:0.557.4P-MIL
Temporal Action LocalizationTHUMOS 2014mAP@0.1:0.747P-MIL
Temporal Action LocalizationTHUMOS 2014mAP@0.540P-MIL
Temporal Action LocalizationTHUMOS14avg-mAP (0.1-0.5)57.4P-MIL
Temporal Action LocalizationTHUMOS14avg-mAP (0.1:0.7)47P-MIL
Temporal Action LocalizationTHUMOS14avg-mAP (0.3-0.7)38P-MIL
Temporal Action LocalizationTHUMOS’14mAP@0.540P-MIL
Temporal Action LocalizationActivityNet-1.3mAP@0.541.8P-MIL
Temporal Action LocalizationActivityNet-1.3mAP@0.5:0.9525.5P-MIL
Temporal Action LocalizationActivityNet-1.2Mean mAP26.5P-MIL
Temporal Action LocalizationActivityNet-1.2mAP@0.544.2P-MIL
Zero-Shot LearningTHUMOS 2014mAP@0.1:0.557.4P-MIL
Zero-Shot LearningTHUMOS 2014mAP@0.1:0.747P-MIL
Zero-Shot LearningTHUMOS 2014mAP@0.540P-MIL
Zero-Shot LearningTHUMOS14avg-mAP (0.1-0.5)57.4P-MIL
Zero-Shot LearningTHUMOS14avg-mAP (0.1:0.7)47P-MIL
Zero-Shot LearningTHUMOS14avg-mAP (0.3-0.7)38P-MIL
Zero-Shot LearningTHUMOS’14mAP@0.540P-MIL
Zero-Shot LearningActivityNet-1.3mAP@0.541.8P-MIL
Zero-Shot LearningActivityNet-1.3mAP@0.5:0.9525.5P-MIL
Zero-Shot LearningActivityNet-1.2Mean mAP26.5P-MIL
Zero-Shot LearningActivityNet-1.2mAP@0.544.2P-MIL
Action LocalizationTHUMOS 2014mAP@0.1:0.557.4P-MIL
Action LocalizationTHUMOS 2014mAP@0.1:0.747P-MIL
Action LocalizationTHUMOS 2014mAP@0.540P-MIL
Action LocalizationTHUMOS14avg-mAP (0.1-0.5)57.4P-MIL
Action LocalizationTHUMOS14avg-mAP (0.1:0.7)47P-MIL
Action LocalizationTHUMOS14avg-mAP (0.3-0.7)38P-MIL
Action LocalizationTHUMOS’14mAP@0.540P-MIL
Action LocalizationActivityNet-1.3mAP@0.541.8P-MIL
Action LocalizationActivityNet-1.3mAP@0.5:0.9525.5P-MIL
Action LocalizationActivityNet-1.2Mean mAP26.5P-MIL
Action LocalizationActivityNet-1.2mAP@0.544.2P-MIL
Weakly Supervised Action LocalizationTHUMOS 2014mAP@0.1:0.557.4P-MIL
Weakly Supervised Action LocalizationTHUMOS 2014mAP@0.1:0.747P-MIL
Weakly Supervised Action LocalizationTHUMOS 2014mAP@0.540P-MIL
Weakly Supervised Action LocalizationTHUMOS14avg-mAP (0.1-0.5)57.4P-MIL
Weakly Supervised Action LocalizationTHUMOS14avg-mAP (0.1:0.7)47P-MIL
Weakly Supervised Action LocalizationTHUMOS14avg-mAP (0.3-0.7)38P-MIL
Weakly Supervised Action LocalizationTHUMOS’14mAP@0.540P-MIL
Weakly Supervised Action LocalizationActivityNet-1.3mAP@0.541.8P-MIL
Weakly Supervised Action LocalizationActivityNet-1.3mAP@0.5:0.9525.5P-MIL
Weakly Supervised Action LocalizationActivityNet-1.2Mean mAP26.5P-MIL
Weakly Supervised Action LocalizationActivityNet-1.2mAP@0.544.2P-MIL

Related Papers

DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action Recognition2025-07-16GNN-ViTCap: GNN-Enhanced Multiple Instance Learning with Vision Transformers for Whole Slide Image Classification and Captioning2025-07-09The Trilemma of Truth in Large Language Models2025-06-30OTSurv: A Novel Multiple Instance Learning Framework for Survival Prediction with Heterogeneity-aware Optimal Transport2025-06-25Benchmarking histopathology foundation models in a multi-center dataset for skin cancer subtyping2025-06-23Including Semantic Information via Word Embeddings for Skeleton-based Action Recognition2025-06-23MiCo: Multiple Instance Learning with Context-Aware Clustering for Whole Slide Image Analysis2025-06-22HyperPath: Knowledge-Guided Hyperbolic Semantic Hierarchy Modeling for WSI Analysis2025-06-19