TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/G-TAD: Sub-Graph Localization for Temporal Action Detection

G-TAD: Sub-Graph Localization for Temporal Action Detection

Mengmeng Xu, Chen Zhao, David S. Rojas, Ali Thabet, Bernard Ghanem

2019-11-26CVPR 2020 6Temporal Action Localization
PaperPDFCodeCodeCodeCodeCodeCode(official)Code

Abstract

Temporal action detection is a fundamental yet challenging task in video understanding. Video context is a critical cue to effectively detect actions, but current works mainly focus on temporal context, while neglecting semantic context as well as other important context properties. In this work, we propose a graph convolutional network (GCN) model to adaptively incorporate multi-level semantic context into video features and cast temporal action detection as a sub-graph localization problem. Specifically, we formulate video snippets as graph nodes, snippet-snippet correlations as edges, and actions associated with context as target sub-graphs. With graph convolution as the basic operation, we design a GCN block called GCNeXt, which learns the features of each node by aggregating its context and dynamically updates the edges in the graph. To localize each sub-graph, we also design an SGAlign layer to embed each sub-graph into the Euclidean space. Extensive experiments show that G-TAD is capable of finding effective video context without extra supervision and achieves state-of-the-art performance on two detection benchmarks. On ActivityNet-1.3, it obtains an average mAP of 34.09%; on THUMOS14, it reaches 51.6% at IoU@0.5 when combined with a proposal processing method. G-TAD code is publicly available at https://github.com/frostinassiky/gtad.

Results

TaskDatasetMetricValueModel
VideoActivityNet-1.3mAP34.09G-TAD
VideoActivityNet-1.3mAP IOU@0.550.36G-TAD
VideoActivityNet-1.3mAP IOU@0.7534.6G-TAD
VideoActivityNet-1.3mAP IOU@0.959.02G-TAD
VideoFineActionmAP9.06G-TAD (i3d feature)
VideoFineActionmAP IOU@0.513.74G-TAD (i3d feature)
VideoFineActionmAP IOU@0.758.83G-TAD (i3d feature)
VideoFineActionmAP IOU@0.953.06G-TAD (i3d feature)
VideoTHUMOS’14mAP IOU@0.540.2G-TAD
VideoEPIC-KITCHENS-100Avg mAP (0.1-0.5)9.4G-TAD (verb)
VideoEPIC-KITCHENS-100mAP IOU@0.112.1G-TAD (verb)
VideoEPIC-KITCHENS-100mAP IOU@0.211G-TAD (verb)
VideoEPIC-KITCHENS-100mAP IOU@0.39.4G-TAD (verb)
VideoEPIC-KITCHENS-100mAP IOU@0.48.1G-TAD (verb)
VideoEPIC-KITCHENS-100mAP IOU@0.56.5G-TAD (verb)
Temporal Action LocalizationActivityNet-1.3mAP34.09G-TAD
Temporal Action LocalizationActivityNet-1.3mAP IOU@0.550.36G-TAD
Temporal Action LocalizationActivityNet-1.3mAP IOU@0.7534.6G-TAD
Temporal Action LocalizationActivityNet-1.3mAP IOU@0.959.02G-TAD
Temporal Action LocalizationFineActionmAP9.06G-TAD (i3d feature)
Temporal Action LocalizationFineActionmAP IOU@0.513.74G-TAD (i3d feature)
Temporal Action LocalizationFineActionmAP IOU@0.758.83G-TAD (i3d feature)
Temporal Action LocalizationFineActionmAP IOU@0.953.06G-TAD (i3d feature)
Temporal Action LocalizationTHUMOS’14mAP IOU@0.540.2G-TAD
Temporal Action LocalizationEPIC-KITCHENS-100Avg mAP (0.1-0.5)9.4G-TAD (verb)
Temporal Action LocalizationEPIC-KITCHENS-100mAP IOU@0.112.1G-TAD (verb)
Temporal Action LocalizationEPIC-KITCHENS-100mAP IOU@0.211G-TAD (verb)
Temporal Action LocalizationEPIC-KITCHENS-100mAP IOU@0.39.4G-TAD (verb)
Temporal Action LocalizationEPIC-KITCHENS-100mAP IOU@0.48.1G-TAD (verb)
Temporal Action LocalizationEPIC-KITCHENS-100mAP IOU@0.56.5G-TAD (verb)
Zero-Shot LearningActivityNet-1.3mAP34.09G-TAD
Zero-Shot LearningActivityNet-1.3mAP IOU@0.550.36G-TAD
Zero-Shot LearningActivityNet-1.3mAP IOU@0.7534.6G-TAD
Zero-Shot LearningActivityNet-1.3mAP IOU@0.959.02G-TAD
Zero-Shot LearningFineActionmAP9.06G-TAD (i3d feature)
Zero-Shot LearningFineActionmAP IOU@0.513.74G-TAD (i3d feature)
Zero-Shot LearningFineActionmAP IOU@0.758.83G-TAD (i3d feature)
Zero-Shot LearningFineActionmAP IOU@0.953.06G-TAD (i3d feature)
Zero-Shot LearningTHUMOS’14mAP IOU@0.540.2G-TAD
Zero-Shot LearningEPIC-KITCHENS-100Avg mAP (0.1-0.5)9.4G-TAD (verb)
Zero-Shot LearningEPIC-KITCHENS-100mAP IOU@0.112.1G-TAD (verb)
Zero-Shot LearningEPIC-KITCHENS-100mAP IOU@0.211G-TAD (verb)
Zero-Shot LearningEPIC-KITCHENS-100mAP IOU@0.39.4G-TAD (verb)
Zero-Shot LearningEPIC-KITCHENS-100mAP IOU@0.48.1G-TAD (verb)
Zero-Shot LearningEPIC-KITCHENS-100mAP IOU@0.56.5G-TAD (verb)
Action LocalizationActivityNet-1.3mAP34.09G-TAD
Action LocalizationActivityNet-1.3mAP IOU@0.550.36G-TAD
Action LocalizationActivityNet-1.3mAP IOU@0.7534.6G-TAD
Action LocalizationActivityNet-1.3mAP IOU@0.959.02G-TAD
Action LocalizationFineActionmAP9.06G-TAD (i3d feature)
Action LocalizationFineActionmAP IOU@0.513.74G-TAD (i3d feature)
Action LocalizationFineActionmAP IOU@0.758.83G-TAD (i3d feature)
Action LocalizationFineActionmAP IOU@0.953.06G-TAD (i3d feature)
Action LocalizationTHUMOS’14mAP IOU@0.540.2G-TAD
Action LocalizationEPIC-KITCHENS-100Avg mAP (0.1-0.5)9.4G-TAD (verb)
Action LocalizationEPIC-KITCHENS-100mAP IOU@0.112.1G-TAD (verb)
Action LocalizationEPIC-KITCHENS-100mAP IOU@0.211G-TAD (verb)
Action LocalizationEPIC-KITCHENS-100mAP IOU@0.39.4G-TAD (verb)
Action LocalizationEPIC-KITCHENS-100mAP IOU@0.48.1G-TAD (verb)
Action LocalizationEPIC-KITCHENS-100mAP IOU@0.56.5G-TAD (verb)

Related Papers

DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action Recognition2025-07-16Including Semantic Information via Word Embeddings for Skeleton-based Action Recognition2025-06-23Zero-Shot Temporal Interaction Localization for Egocentric Videos2025-06-04A Review on Coarse to Fine-Grained Animal Action Recognition2025-06-01CLIP-AE: CLIP-assisted Cross-view Audio-Visual Enhancement for Unsupervised Temporal Action Localization2025-05-29DeepConvContext: A Multi-Scale Approach to Timeseries Classification in Human Activity Recognition2025-05-27ProTAL: A Drag-and-Link Video Programming Framework for Temporal Action Localization2025-05-23Are Spatial-Temporal Graph Convolution Networks for Human Action Recognition Over-Parameterized?2025-05-15