TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Weakly-Supervised Action Localization by Generative Attent...

Weakly-Supervised Action Localization by Generative Attention Modeling

Baifeng Shi, Qi Dai, Yadong Mu, Jingdong Wang

2020-03-27CVPR 2020 6Weakly Supervised Action LocalizationAction LocalizationWeakly-supervised Temporal Action LocalizationTemporal Action Localization
PaperPDFCode(official)

Abstract

Weakly-supervised temporal action localization is a problem of learning an action localization model with only video-level action labeling available. The general framework largely relies on the classification activation, which employs an attention model to identify the action-related frames and then categorizes them into different classes. Such method results in the action-context confusion issue: context frames near action clips tend to be recognized as action frames themselves, since they are closely related to the specific classes. To solve the problem, in this paper we propose to model the class-agnostic frame-wise probability conditioned on the frame attention using conditional Variational Auto-Encoder (VAE). With the observation that the context exhibits notable difference from the action at representation level, a probabilistic model, i.e., conditional VAE, is learned to model the likelihood of each frame given the attention. By maximizing the conditional probability with respect to the attention, the action and non-action frames are well separated. Experiments on THUMOS14 and ActivityNet1.2 demonstrate advantage of our method and effectiveness in handling action-context confusion problem. Code is now available on GitHub.

Results

TaskDatasetMetricValueModel
VideoTHUMOS 2014mAP@0.1:0.545.6DGAM
VideoTHUMOS 2014mAP@0.1:0.737DGAM
VideoTHUMOS 2014mAP@0.528.8DGAM
VideoActivityNet-1.2Mean mAP24.4DGAM
VideoActivityNet-1.2mAP@0.541DGAM
Temporal Action LocalizationTHUMOS 2014mAP@0.1:0.545.6DGAM
Temporal Action LocalizationTHUMOS 2014mAP@0.1:0.737DGAM
Temporal Action LocalizationTHUMOS 2014mAP@0.528.8DGAM
Temporal Action LocalizationActivityNet-1.2Mean mAP24.4DGAM
Temporal Action LocalizationActivityNet-1.2mAP@0.541DGAM
Zero-Shot LearningTHUMOS 2014mAP@0.1:0.545.6DGAM
Zero-Shot LearningTHUMOS 2014mAP@0.1:0.737DGAM
Zero-Shot LearningTHUMOS 2014mAP@0.528.8DGAM
Zero-Shot LearningActivityNet-1.2Mean mAP24.4DGAM
Zero-Shot LearningActivityNet-1.2mAP@0.541DGAM
Action LocalizationTHUMOS 2014mAP@0.1:0.545.6DGAM
Action LocalizationTHUMOS 2014mAP@0.1:0.737DGAM
Action LocalizationTHUMOS 2014mAP@0.528.8DGAM
Action LocalizationActivityNet-1.2Mean mAP24.4DGAM
Action LocalizationActivityNet-1.2mAP@0.541DGAM
Weakly Supervised Action LocalizationTHUMOS 2014mAP@0.1:0.545.6DGAM
Weakly Supervised Action LocalizationTHUMOS 2014mAP@0.1:0.737DGAM
Weakly Supervised Action LocalizationTHUMOS 2014mAP@0.528.8DGAM
Weakly Supervised Action LocalizationActivityNet-1.2Mean mAP24.4DGAM
Weakly Supervised Action LocalizationActivityNet-1.2mAP@0.541DGAM

Related Papers

DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action Recognition2025-07-16Including Semantic Information via Word Embeddings for Skeleton-based Action Recognition2025-06-23Zero-Shot Temporal Interaction Localization for Egocentric Videos2025-06-04A Review on Coarse to Fine-Grained Animal Action Recognition2025-06-01LLM-powered Query Expansion for Enhancing Boundary Prediction in Language-driven Action Localization2025-05-30CLIP-AE: CLIP-assisted Cross-view Audio-Visual Enhancement for Unsupervised Temporal Action Localization2025-05-29DeepConvContext: A Multi-Scale Approach to Timeseries Classification in Human Activity Recognition2025-05-27ProTAL: A Drag-and-Link Video Programming Framework for Temporal Action Localization2025-05-23