TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Weakly-Supervised Action Localization with Expectation-Max...

Weakly-Supervised Action Localization with Expectation-Maximization Multi-Instance Learning

Zhekun Luo, Devin Guillory, Baifeng Shi, Wei Ke, Fang Wan, Trevor Darrell, Huijuan Xu

2020-03-31ECCV 2020 8Weakly Supervised Action LocalizationAction LocalizationMultiple Instance Learning
PaperPDFCode(official)

Abstract

Weakly-supervised action localization requires training a model to localize the action segments in the video given only video level action label. It can be solved under the Multiple Instance Learning (MIL) framework, where a bag (video) contains multiple instances (action segments). Since only the bag's label is known, the main challenge is assigning which key instances within the bag to trigger the bag's label. Most previous models use attention-based approaches applying attentions to generate the bag's representation from instances, and then train it via the bag's classification. These models, however, implicitly violate the MIL assumption that instances in negative bags should be uniformly negative. In this work, we explicitly model the key instances assignment as a hidden variable and adopt an Expectation-Maximization (EM) framework. We derive two pseudo-label generation schemes to model the E and M process and iteratively optimize the likelihood lower bound. We show that our EM-MIL approach more accurately models both the learning objective and the MIL assumptions. It achieves state-of-the-art performance on two standard benchmarks, THUMOS14 and ActivityNet1.2.

Results

TaskDatasetMetricValueModel
VideoTHUMOS14avg-mAP (0.1-0.5)44.9EM-ML
VideoTHUMOS14avg-mAP (0.1:0.7)37.7EM-ML
VideoTHUMOS14avg-mAP (0.3-0.7)30.4EM-ML
VideoTHUMOS’14mAP@0.530.5EM-MIL
Temporal Action LocalizationTHUMOS14avg-mAP (0.1-0.5)44.9EM-ML
Temporal Action LocalizationTHUMOS14avg-mAP (0.1:0.7)37.7EM-ML
Temporal Action LocalizationTHUMOS14avg-mAP (0.3-0.7)30.4EM-ML
Temporal Action LocalizationTHUMOS’14mAP@0.530.5EM-MIL
Zero-Shot LearningTHUMOS14avg-mAP (0.1-0.5)44.9EM-ML
Zero-Shot LearningTHUMOS14avg-mAP (0.1:0.7)37.7EM-ML
Zero-Shot LearningTHUMOS14avg-mAP (0.3-0.7)30.4EM-ML
Zero-Shot LearningTHUMOS’14mAP@0.530.5EM-MIL
Action LocalizationTHUMOS14avg-mAP (0.1-0.5)44.9EM-ML
Action LocalizationTHUMOS14avg-mAP (0.1:0.7)37.7EM-ML
Action LocalizationTHUMOS14avg-mAP (0.3-0.7)30.4EM-ML
Action LocalizationTHUMOS’14mAP@0.530.5EM-MIL
Weakly Supervised Action LocalizationTHUMOS14avg-mAP (0.1-0.5)44.9EM-ML
Weakly Supervised Action LocalizationTHUMOS14avg-mAP (0.1:0.7)37.7EM-ML
Weakly Supervised Action LocalizationTHUMOS14avg-mAP (0.3-0.7)30.4EM-ML
Weakly Supervised Action LocalizationTHUMOS’14mAP@0.530.5EM-MIL

Related Papers

GNN-ViTCap: GNN-Enhanced Multiple Instance Learning with Vision Transformers for Whole Slide Image Classification and Captioning2025-07-09The Trilemma of Truth in Large Language Models2025-06-30OTSurv: A Novel Multiple Instance Learning Framework for Survival Prediction with Heterogeneity-aware Optimal Transport2025-06-25Benchmarking histopathology foundation models in a multi-center dataset for skin cancer subtyping2025-06-23MiCo: Multiple Instance Learning with Context-Aware Clustering for Whole Slide Image Analysis2025-06-22HyperPath: Knowledge-Guided Hyperbolic Semantic Hierarchy Modeling for WSI Analysis2025-06-19Dual‑detector Re‑optimization for Federated Weakly Supervised Video Anomaly Detection Via Adaptive Dynamic Recursive Mapping2025-06-13BioLangFusion: Multimodal Fusion of DNA, mRNA, and Protein Language Models2025-06-10