TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/PointTAD: Multi-Label Temporal Action Detection with Learn...

PointTAD: Multi-Label Temporal Action Detection with Learnable Query Points

Jing Tan, Xiaotong Zhao, Xintian Shi, Bin Kang, LiMin Wang

2022-10-20Action DetectionTemporal Action Localization
PaperPDFCode(official)

Abstract

Traditional temporal action detection (TAD) usually handles untrimmed videos with small number of action instances from a single label (e.g., ActivityNet, THUMOS). However, this setting might be unrealistic as different classes of actions often co-occur in practice. In this paper, we focus on the task of multi-label temporal action detection that aims to localize all action instances from a multi-label untrimmed video. Multi-label TAD is more challenging as it requires for fine-grained class discrimination within a single video and precise localization of the co-occurring instances. To mitigate this issue, we extend the sparse query-based detection paradigm from the traditional TAD and propose the multi-label TAD framework of PointTAD. Specifically, our PointTAD introduces a small set of learnable query points to represent the important frames of each action instance. This point-based representation provides a flexible mechanism to localize the discriminative frames at boundaries and as well the important frames inside the action. Moreover, we perform the action decoding process with the Multi-level Interactive Module to capture both point-level and instance-level action semantics. Finally, our PointTAD employs an end-to-end trainable framework simply based on RGB input for easy deployment. We evaluate our proposed method on two popular benchmarks and introduce the new metric of detection-mAP for multi-label TAD. Our model outperforms all previous methods by a large margin under the detection-mAP metric, and also achieves promising results under the segmentation-mAP metric. Code is available at https://github.com/MCG-NJU/PointTAD.

Results

TaskDatasetMetricValueModel
VideoMultiTHUMOSAverage mAP23.5PointTAD
VideoMultiTHUMOSmAP IOU@0.142.3PointTAD
VideoMultiTHUMOSmAP IOU@0.239.7PointTAD
VideoMultiTHUMOSmAP IOU@0.335.8PointTAD
VideoMultiTHUMOSmAP IOU@0.430.9PointTAD
VideoMultiTHUMOSmAP IOU@0.524.9PointTAD
VideoMultiTHUMOSmAP IOU@0.618.5PointTAD
VideoMultiTHUMOSmAP IOU@0.712PointTAD
VideoMultiTHUMOSmAP IOU@0.85.6PointTAD
VideoMultiTHUMOSmAP IOU@0.91.4PointTAD
Temporal Action LocalizationMultiTHUMOSAverage mAP23.5PointTAD
Temporal Action LocalizationMultiTHUMOSmAP IOU@0.142.3PointTAD
Temporal Action LocalizationMultiTHUMOSmAP IOU@0.239.7PointTAD
Temporal Action LocalizationMultiTHUMOSmAP IOU@0.335.8PointTAD
Temporal Action LocalizationMultiTHUMOSmAP IOU@0.430.9PointTAD
Temporal Action LocalizationMultiTHUMOSmAP IOU@0.524.9PointTAD
Temporal Action LocalizationMultiTHUMOSmAP IOU@0.618.5PointTAD
Temporal Action LocalizationMultiTHUMOSmAP IOU@0.712PointTAD
Temporal Action LocalizationMultiTHUMOSmAP IOU@0.85.6PointTAD
Temporal Action LocalizationMultiTHUMOSmAP IOU@0.91.4PointTAD
Zero-Shot LearningMultiTHUMOSAverage mAP23.5PointTAD
Zero-Shot LearningMultiTHUMOSmAP IOU@0.142.3PointTAD
Zero-Shot LearningMultiTHUMOSmAP IOU@0.239.7PointTAD
Zero-Shot LearningMultiTHUMOSmAP IOU@0.335.8PointTAD
Zero-Shot LearningMultiTHUMOSmAP IOU@0.430.9PointTAD
Zero-Shot LearningMultiTHUMOSmAP IOU@0.524.9PointTAD
Zero-Shot LearningMultiTHUMOSmAP IOU@0.618.5PointTAD
Zero-Shot LearningMultiTHUMOSmAP IOU@0.712PointTAD
Zero-Shot LearningMultiTHUMOSmAP IOU@0.85.6PointTAD
Zero-Shot LearningMultiTHUMOSmAP IOU@0.91.4PointTAD
Action LocalizationMultiTHUMOSAverage mAP23.5PointTAD
Action LocalizationMultiTHUMOSmAP IOU@0.142.3PointTAD
Action LocalizationMultiTHUMOSmAP IOU@0.239.7PointTAD
Action LocalizationMultiTHUMOSmAP IOU@0.335.8PointTAD
Action LocalizationMultiTHUMOSmAP IOU@0.430.9PointTAD
Action LocalizationMultiTHUMOSmAP IOU@0.524.9PointTAD
Action LocalizationMultiTHUMOSmAP IOU@0.618.5PointTAD
Action LocalizationMultiTHUMOSmAP IOU@0.712PointTAD
Action LocalizationMultiTHUMOSmAP IOU@0.85.6PointTAD
Action LocalizationMultiTHUMOSmAP IOU@0.91.4PointTAD

Related Papers

DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action Recognition2025-07-16CBF-AFA: Chunk-Based Multi-SSL Fusion for Automatic Fluency Assessment2025-06-25MultiHuman-Testbench: Benchmarking Image Generation for Multiple Humans2025-06-25Including Semantic Information via Word Embeddings for Skeleton-based Action Recognition2025-06-23Distributed Activity Detection for Cell-Free Hybrid Near-Far Field Communications2025-06-17Zero-Shot Temporal Interaction Localization for Egocentric Videos2025-06-04Speaker Diarization with Overlapping Community Detection Using Graph Attention Networks and Label Propagation Algorithm2025-06-03Attention Is Not Always the Answer: Optimizing Voice Activity Detection with Simple Feature Fusion2025-06-02