TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/End-to-End Temporal Action Detection with 1B Parameters Ac...

End-to-End Temporal Action Detection with 1B Parameters Across 1000 Frames

Shuming Liu, Chen-Lin Zhang, Chen Zhao, Bernard Ghanem

2023-11-28CVPR 2024 1Action DetectionTemporal Action Localization
PaperPDFCode(official)Code

Abstract

Recently, temporal action detection (TAD) has seen significant performance improvement with end-to-end training. However, due to the memory bottleneck, only models with limited scales and limited data volumes can afford end-to-end training, which inevitably restricts TAD performance. In this paper, we reduce the memory consumption for end-to-end training, and manage to scale up the TAD backbone to 1 billion parameters and the input video to 1,536 frames, leading to significant detection performance. The key to our approach lies in our proposed temporal-informative adapter (TIA), which is a novel lightweight module that reduces training memory. Using TIA, we free the humongous backbone from learning to adapt to the TAD task by only updating the parameters in TIA. TIA also leads to better TAD representation by temporally aggregating context from adjacent frames throughout the backbone. We evaluate our model across four representative datasets. Owing to our efficient design, we are able to train end-to-end on VideoMAEv2-giant and achieve 75.4% mAP on THUMOS14, being the first end-to-end model to outperform the best feature-based methods. Code is available at https://github.com/sming256/AdaTAD.

Results

TaskDatasetMetricValueModel
VideoActivityNet-1.3mAP41.93AdaTAD (VideoMAEv2-giant)
VideoActivityNet-1.3mAP IOU@0.561.72AdaTAD (VideoMAEv2-giant)
VideoActivityNet-1.3mAP IOU@0.7543.35AdaTAD (VideoMAEv2-giant)
VideoActivityNet-1.3mAP IOU@0.9510.85AdaTAD (VideoMAEv2-giant)
VideoTHUMOS’14Avg mAP (0.3:0.7)76.9AdaTAD (VideoMAEv2-giant)
VideoTHUMOS’14mAP IOU@0.389.7AdaTAD (VideoMAEv2-giant)
VideoTHUMOS’14mAP IOU@0.486.7AdaTAD (VideoMAEv2-giant)
VideoTHUMOS’14mAP IOU@0.580.9AdaTAD (VideoMAEv2-giant)
VideoTHUMOS’14mAP IOU@0.671AdaTAD (VideoMAEv2-giant)
VideoTHUMOS’14mAP IOU@0.756.1AdaTAD (VideoMAEv2-giant)
VideoEPIC-KITCHENS-100Avg mAP (0.1-0.5)29.3AdaTAD (verb, VideoMAE-L)
VideoEPIC-KITCHENS-100mAP IOU@0.133.1AdaTAD (verb, VideoMAE-L)
VideoEPIC-KITCHENS-100mAP IOU@0.232.2AdaTAD (verb, VideoMAE-L)
VideoEPIC-KITCHENS-100mAP IOU@0.330.4AdaTAD (verb, VideoMAE-L)
VideoEPIC-KITCHENS-100mAP IOU@0.427.5AdaTAD (verb, VideoMAE-L)
VideoEPIC-KITCHENS-100mAP IOU@0.523.1AdaTAD (verb, VideoMAE-L)
Temporal Action LocalizationActivityNet-1.3mAP41.93AdaTAD (VideoMAEv2-giant)
Temporal Action LocalizationActivityNet-1.3mAP IOU@0.561.72AdaTAD (VideoMAEv2-giant)
Temporal Action LocalizationActivityNet-1.3mAP IOU@0.7543.35AdaTAD (VideoMAEv2-giant)
Temporal Action LocalizationActivityNet-1.3mAP IOU@0.9510.85AdaTAD (VideoMAEv2-giant)
Temporal Action LocalizationTHUMOS’14Avg mAP (0.3:0.7)76.9AdaTAD (VideoMAEv2-giant)
Temporal Action LocalizationTHUMOS’14mAP IOU@0.389.7AdaTAD (VideoMAEv2-giant)
Temporal Action LocalizationTHUMOS’14mAP IOU@0.486.7AdaTAD (VideoMAEv2-giant)
Temporal Action LocalizationTHUMOS’14mAP IOU@0.580.9AdaTAD (VideoMAEv2-giant)
Temporal Action LocalizationTHUMOS’14mAP IOU@0.671AdaTAD (VideoMAEv2-giant)
Temporal Action LocalizationTHUMOS’14mAP IOU@0.756.1AdaTAD (VideoMAEv2-giant)
Temporal Action LocalizationEPIC-KITCHENS-100Avg mAP (0.1-0.5)29.3AdaTAD (verb, VideoMAE-L)
Temporal Action LocalizationEPIC-KITCHENS-100mAP IOU@0.133.1AdaTAD (verb, VideoMAE-L)
Temporal Action LocalizationEPIC-KITCHENS-100mAP IOU@0.232.2AdaTAD (verb, VideoMAE-L)
Temporal Action LocalizationEPIC-KITCHENS-100mAP IOU@0.330.4AdaTAD (verb, VideoMAE-L)
Temporal Action LocalizationEPIC-KITCHENS-100mAP IOU@0.427.5AdaTAD (verb, VideoMAE-L)
Temporal Action LocalizationEPIC-KITCHENS-100mAP IOU@0.523.1AdaTAD (verb, VideoMAE-L)
Zero-Shot LearningActivityNet-1.3mAP41.93AdaTAD (VideoMAEv2-giant)
Zero-Shot LearningActivityNet-1.3mAP IOU@0.561.72AdaTAD (VideoMAEv2-giant)
Zero-Shot LearningActivityNet-1.3mAP IOU@0.7543.35AdaTAD (VideoMAEv2-giant)
Zero-Shot LearningActivityNet-1.3mAP IOU@0.9510.85AdaTAD (VideoMAEv2-giant)
Zero-Shot LearningTHUMOS’14Avg mAP (0.3:0.7)76.9AdaTAD (VideoMAEv2-giant)
Zero-Shot LearningTHUMOS’14mAP IOU@0.389.7AdaTAD (VideoMAEv2-giant)
Zero-Shot LearningTHUMOS’14mAP IOU@0.486.7AdaTAD (VideoMAEv2-giant)
Zero-Shot LearningTHUMOS’14mAP IOU@0.580.9AdaTAD (VideoMAEv2-giant)
Zero-Shot LearningTHUMOS’14mAP IOU@0.671AdaTAD (VideoMAEv2-giant)
Zero-Shot LearningTHUMOS’14mAP IOU@0.756.1AdaTAD (VideoMAEv2-giant)
Zero-Shot LearningEPIC-KITCHENS-100Avg mAP (0.1-0.5)29.3AdaTAD (verb, VideoMAE-L)
Zero-Shot LearningEPIC-KITCHENS-100mAP IOU@0.133.1AdaTAD (verb, VideoMAE-L)
Zero-Shot LearningEPIC-KITCHENS-100mAP IOU@0.232.2AdaTAD (verb, VideoMAE-L)
Zero-Shot LearningEPIC-KITCHENS-100mAP IOU@0.330.4AdaTAD (verb, VideoMAE-L)
Zero-Shot LearningEPIC-KITCHENS-100mAP IOU@0.427.5AdaTAD (verb, VideoMAE-L)
Zero-Shot LearningEPIC-KITCHENS-100mAP IOU@0.523.1AdaTAD (verb, VideoMAE-L)
Action LocalizationActivityNet-1.3mAP41.93AdaTAD (VideoMAEv2-giant)
Action LocalizationActivityNet-1.3mAP IOU@0.561.72AdaTAD (VideoMAEv2-giant)
Action LocalizationActivityNet-1.3mAP IOU@0.7543.35AdaTAD (VideoMAEv2-giant)
Action LocalizationActivityNet-1.3mAP IOU@0.9510.85AdaTAD (VideoMAEv2-giant)
Action LocalizationTHUMOS’14Avg mAP (0.3:0.7)76.9AdaTAD (VideoMAEv2-giant)
Action LocalizationTHUMOS’14mAP IOU@0.389.7AdaTAD (VideoMAEv2-giant)
Action LocalizationTHUMOS’14mAP IOU@0.486.7AdaTAD (VideoMAEv2-giant)
Action LocalizationTHUMOS’14mAP IOU@0.580.9AdaTAD (VideoMAEv2-giant)
Action LocalizationTHUMOS’14mAP IOU@0.671AdaTAD (VideoMAEv2-giant)
Action LocalizationTHUMOS’14mAP IOU@0.756.1AdaTAD (VideoMAEv2-giant)
Action LocalizationEPIC-KITCHENS-100Avg mAP (0.1-0.5)29.3AdaTAD (verb, VideoMAE-L)
Action LocalizationEPIC-KITCHENS-100mAP IOU@0.133.1AdaTAD (verb, VideoMAE-L)
Action LocalizationEPIC-KITCHENS-100mAP IOU@0.232.2AdaTAD (verb, VideoMAE-L)
Action LocalizationEPIC-KITCHENS-100mAP IOU@0.330.4AdaTAD (verb, VideoMAE-L)
Action LocalizationEPIC-KITCHENS-100mAP IOU@0.427.5AdaTAD (verb, VideoMAE-L)
Action LocalizationEPIC-KITCHENS-100mAP IOU@0.523.1AdaTAD (verb, VideoMAE-L)

Related Papers

DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action Recognition2025-07-16CBF-AFA: Chunk-Based Multi-SSL Fusion for Automatic Fluency Assessment2025-06-25MultiHuman-Testbench: Benchmarking Image Generation for Multiple Humans2025-06-25Including Semantic Information via Word Embeddings for Skeleton-based Action Recognition2025-06-23Distributed Activity Detection for Cell-Free Hybrid Near-Far Field Communications2025-06-17Zero-Shot Temporal Interaction Localization for Egocentric Videos2025-06-04Speaker Diarization with Overlapping Community Detection Using Graph Attention Networks and Label Propagation Algorithm2025-06-03Attention Is Not Always the Answer: Optimizing Voice Activity Detection with Simple Feature Fusion2025-06-02