TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Temporal Context Aggregation Network for Temporal Action P...

Temporal Context Aggregation Network for Temporal Action Proposal Refinement

Zhiwu Qing, Haisheng Su, Weihao Gan, Dongliang Wang, Wei Wu, Xiang Wang, Yu Qiao, Junjie Yan, Changxin Gao, Nong Sang

2021-03-24CVPR 2021 1Action DetectionAction LocalizationTemporal Action Proposal GenerationVideo UnderstandingRetrievalTemporal Action Localization
PaperPDFCode

Abstract

Temporal action proposal generation aims to estimate temporal intervals of actions in untrimmed videos, which is a challenging yet important task in the video understanding field. The proposals generated by current methods still suffer from inaccurate temporal boundaries and inferior confidence used for retrieval owing to the lack of efficient temporal modeling and effective boundary context utilization. In this paper, we propose Temporal Context Aggregation Network (TCANet) to generate high-quality action proposals through "local and global" temporal context aggregation and complementary as well as progressive boundary refinement. Specifically, we first design a Local-Global Temporal Encoder (LGTE), which adopts the channel grouping strategy to efficiently encode both "local and global" temporal inter-dependencies. Furthermore, both the boundary and internal context of proposals are adopted for frame-level and segment-level boundary regressions, respectively. Temporal Boundary Regressor (TBR) is designed to combine these two regression granularities in an end-to-end fashion, which achieves the precise boundaries and reliable confidence of proposals through progressive refinement. Extensive experiments are conducted on three challenging datasets: HACS, ActivityNet-v1.3, and THUMOS-14, where TCANet can generate proposals with high precision and recall. By combining with the existing action classifier, TCANet can obtain remarkable temporal action detection performance compared with other methods. Not surprisingly, the proposed TCANet won the 1$^{st}$ place in the CVPR 2020 - HACS challenge leaderboard on temporal action localization task.

Results

TaskDatasetMetricValueModel
VideoActivityNet-1.3mAP37.56TCANet (SlowFast R101)
VideoActivityNet-1.3mAP IOU@0.554.33TCANet (SlowFast R101)
VideoActivityNet-1.3mAP IOU@0.7539.13TCANet (SlowFast R101)
VideoActivityNet-1.3mAP IOU@0.958.41TCANet (SlowFast R101)
Temporal Action LocalizationActivityNet-1.3mAP37.56TCANet (SlowFast R101)
Temporal Action LocalizationActivityNet-1.3mAP IOU@0.554.33TCANet (SlowFast R101)
Temporal Action LocalizationActivityNet-1.3mAP IOU@0.7539.13TCANet (SlowFast R101)
Temporal Action LocalizationActivityNet-1.3mAP IOU@0.958.41TCANet (SlowFast R101)
Zero-Shot LearningActivityNet-1.3mAP37.56TCANet (SlowFast R101)
Zero-Shot LearningActivityNet-1.3mAP IOU@0.554.33TCANet (SlowFast R101)
Zero-Shot LearningActivityNet-1.3mAP IOU@0.7539.13TCANet (SlowFast R101)
Zero-Shot LearningActivityNet-1.3mAP IOU@0.958.41TCANet (SlowFast R101)
Action LocalizationActivityNet-1.3mAP37.56TCANet (SlowFast R101)
Action LocalizationActivityNet-1.3mAP IOU@0.554.33TCANet (SlowFast R101)
Action LocalizationActivityNet-1.3mAP IOU@0.7539.13TCANet (SlowFast R101)
Action LocalizationActivityNet-1.3mAP IOU@0.958.41TCANet (SlowFast R101)

Related Papers

VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding2025-07-17From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17Developing Visual Augmented Q&A System using Scalable Vision Embedding Retrieval & Late Interaction Re-ranker2025-07-16Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16Context-Aware Search and Retrieval Over Erasure Channels2025-07-16