TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Temporal Action Localization with Enhanced Instant Discrim...

Temporal Action Localization with Enhanced Instant Discriminability

Dingfeng Shi, Qiong Cao, Yujie Zhong, Shan An, Jian Cheng, Haogang Zhu, DaCheng Tao

2023-09-11Action DetectionAction LocalizationTemporal Action Localization
PaperPDFCodeCodeCode(official)

Abstract

Temporal action detection (TAD) aims to detect all action boundaries and their corresponding categories in an untrimmed video. The unclear boundaries of actions in videos often result in imprecise predictions of action boundaries by existing methods. To resolve this issue, we propose a one-stage framework named TriDet. First, we propose a Trident-head to model the action boundary via an estimated relative probability distribution around the boundary. Then, we analyze the rank-loss problem (i.e. instant discriminability deterioration) in transformer-based methods and propose an efficient scalable-granularity perception (SGP) layer to mitigate this issue. To further push the limit of instant discriminability in the video backbone, we leverage the strong representation capability of pretrained large models and investigate their performance on TAD. Last, considering the adequate spatial-temporal context for classification, we design a decoupled feature pyramid network with separate feature pyramids to incorporate rich spatial context from the large model for localization. Experimental results demonstrate the robustness of TriDet and its state-of-the-art performance on multiple TAD datasets, including hierarchical (multilabel) TAD datasets.

Results

TaskDatasetMetricValueModel
VideoHACSAverage-mAP43.1TriDet (VideoMAEv2)
VideoHACSmAP@0.562.4TriDet (VideoMAEv2)
VideoHACSmAP@0.7544.1TriDet (VideoMAEv2)
VideoHACSmAP@0.9513.1TriDet (VideoMAEv2)
VideoTHUMOS’14Avg mAP (0.3:0.7)70.1TriDet (VideoMAE v2-g feature)
VideoTHUMOS’14mAP IOU@0.384.8TriDet (VideoMAE v2-g feature)
VideoTHUMOS’14mAP IOU@0.480TriDet (VideoMAE v2-g feature)
VideoTHUMOS’14mAP IOU@0.573.3TriDet (VideoMAE v2-g feature)
VideoTHUMOS’14mAP IOU@0.663.8TriDet (VideoMAE v2-g feature)
VideoTHUMOS’14mAP IOU@0.748.8TriDet (VideoMAE v2-g feature)
VideoMultiTHUMOSAverage mAP37.5TriDet (VideoMAEv2)
VideoMultiTHUMOSmAP IOU@0.257.7TriDet (VideoMAEv2)
VideoMultiTHUMOSmAP IOU@0.542.7TriDet (VideoMAEv2)
VideoMultiTHUMOSmAP IOU@0.724.3TriDet (VideoMAEv2)
VideoMultiTHUMOSAverage mAP30.7TriDet (I3D-rgb)
VideoMultiTHUMOSmAP IOU@0.249.1TriDet (I3D-rgb)
VideoMultiTHUMOSmAP IOU@0.534.3TriDet (I3D-rgb)
VideoMultiTHUMOSmAP IOU@0.717.8TriDet (I3D-rgb)
Temporal Action LocalizationHACSAverage-mAP43.1TriDet (VideoMAEv2)
Temporal Action LocalizationHACSmAP@0.562.4TriDet (VideoMAEv2)
Temporal Action LocalizationHACSmAP@0.7544.1TriDet (VideoMAEv2)
Temporal Action LocalizationHACSmAP@0.9513.1TriDet (VideoMAEv2)
Temporal Action LocalizationTHUMOS’14Avg mAP (0.3:0.7)70.1TriDet (VideoMAE v2-g feature)
Temporal Action LocalizationTHUMOS’14mAP IOU@0.384.8TriDet (VideoMAE v2-g feature)
Temporal Action LocalizationTHUMOS’14mAP IOU@0.480TriDet (VideoMAE v2-g feature)
Temporal Action LocalizationTHUMOS’14mAP IOU@0.573.3TriDet (VideoMAE v2-g feature)
Temporal Action LocalizationTHUMOS’14mAP IOU@0.663.8TriDet (VideoMAE v2-g feature)
Temporal Action LocalizationTHUMOS’14mAP IOU@0.748.8TriDet (VideoMAE v2-g feature)
Temporal Action LocalizationMultiTHUMOSAverage mAP37.5TriDet (VideoMAEv2)
Temporal Action LocalizationMultiTHUMOSmAP IOU@0.257.7TriDet (VideoMAEv2)
Temporal Action LocalizationMultiTHUMOSmAP IOU@0.542.7TriDet (VideoMAEv2)
Temporal Action LocalizationMultiTHUMOSmAP IOU@0.724.3TriDet (VideoMAEv2)
Temporal Action LocalizationMultiTHUMOSAverage mAP30.7TriDet (I3D-rgb)
Temporal Action LocalizationMultiTHUMOSmAP IOU@0.249.1TriDet (I3D-rgb)
Temporal Action LocalizationMultiTHUMOSmAP IOU@0.534.3TriDet (I3D-rgb)
Temporal Action LocalizationMultiTHUMOSmAP IOU@0.717.8TriDet (I3D-rgb)
Zero-Shot LearningHACSAverage-mAP43.1TriDet (VideoMAEv2)
Zero-Shot LearningHACSmAP@0.562.4TriDet (VideoMAEv2)
Zero-Shot LearningHACSmAP@0.7544.1TriDet (VideoMAEv2)
Zero-Shot LearningHACSmAP@0.9513.1TriDet (VideoMAEv2)
Zero-Shot LearningTHUMOS’14Avg mAP (0.3:0.7)70.1TriDet (VideoMAE v2-g feature)
Zero-Shot LearningTHUMOS’14mAP IOU@0.384.8TriDet (VideoMAE v2-g feature)
Zero-Shot LearningTHUMOS’14mAP IOU@0.480TriDet (VideoMAE v2-g feature)
Zero-Shot LearningTHUMOS’14mAP IOU@0.573.3TriDet (VideoMAE v2-g feature)
Zero-Shot LearningTHUMOS’14mAP IOU@0.663.8TriDet (VideoMAE v2-g feature)
Zero-Shot LearningTHUMOS’14mAP IOU@0.748.8TriDet (VideoMAE v2-g feature)
Zero-Shot LearningMultiTHUMOSAverage mAP37.5TriDet (VideoMAEv2)
Zero-Shot LearningMultiTHUMOSmAP IOU@0.257.7TriDet (VideoMAEv2)
Zero-Shot LearningMultiTHUMOSmAP IOU@0.542.7TriDet (VideoMAEv2)
Zero-Shot LearningMultiTHUMOSmAP IOU@0.724.3TriDet (VideoMAEv2)
Zero-Shot LearningMultiTHUMOSAverage mAP30.7TriDet (I3D-rgb)
Zero-Shot LearningMultiTHUMOSmAP IOU@0.249.1TriDet (I3D-rgb)
Zero-Shot LearningMultiTHUMOSmAP IOU@0.534.3TriDet (I3D-rgb)
Zero-Shot LearningMultiTHUMOSmAP IOU@0.717.8TriDet (I3D-rgb)
Action LocalizationHACSAverage-mAP43.1TriDet (VideoMAEv2)
Action LocalizationHACSmAP@0.562.4TriDet (VideoMAEv2)
Action LocalizationHACSmAP@0.7544.1TriDet (VideoMAEv2)
Action LocalizationHACSmAP@0.9513.1TriDet (VideoMAEv2)
Action LocalizationTHUMOS’14Avg mAP (0.3:0.7)70.1TriDet (VideoMAE v2-g feature)
Action LocalizationTHUMOS’14mAP IOU@0.384.8TriDet (VideoMAE v2-g feature)
Action LocalizationTHUMOS’14mAP IOU@0.480TriDet (VideoMAE v2-g feature)
Action LocalizationTHUMOS’14mAP IOU@0.573.3TriDet (VideoMAE v2-g feature)
Action LocalizationTHUMOS’14mAP IOU@0.663.8TriDet (VideoMAE v2-g feature)
Action LocalizationTHUMOS’14mAP IOU@0.748.8TriDet (VideoMAE v2-g feature)
Action LocalizationMultiTHUMOSAverage mAP37.5TriDet (VideoMAEv2)
Action LocalizationMultiTHUMOSmAP IOU@0.257.7TriDet (VideoMAEv2)
Action LocalizationMultiTHUMOSmAP IOU@0.542.7TriDet (VideoMAEv2)
Action LocalizationMultiTHUMOSmAP IOU@0.724.3TriDet (VideoMAEv2)
Action LocalizationMultiTHUMOSAverage mAP30.7TriDet (I3D-rgb)
Action LocalizationMultiTHUMOSmAP IOU@0.249.1TriDet (I3D-rgb)
Action LocalizationMultiTHUMOSmAP IOU@0.534.3TriDet (I3D-rgb)
Action LocalizationMultiTHUMOSmAP IOU@0.717.8TriDet (I3D-rgb)

Related Papers

DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action Recognition2025-07-16CBF-AFA: Chunk-Based Multi-SSL Fusion for Automatic Fluency Assessment2025-06-25MultiHuman-Testbench: Benchmarking Image Generation for Multiple Humans2025-06-25Including Semantic Information via Word Embeddings for Skeleton-based Action Recognition2025-06-23Distributed Activity Detection for Cell-Free Hybrid Near-Far Field Communications2025-06-17Zero-Shot Temporal Interaction Localization for Egocentric Videos2025-06-04Speaker Diarization with Overlapping Community Detection Using Graph Attention Networks and Label Propagation Algorithm2025-06-03Attention Is Not Always the Answer: Optimizing Voice Activity Detection with Simple Feature Fusion2025-06-02