TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/TSP: Temporally-Sensitive Pretraining of Video Encoders fo...

TSP: Temporally-Sensitive Pretraining of Video Encoders for Localization Tasks

Humam Alwassel, Silvio Giancola, Bernard Ghanem

2020-11-23Action ClassificationAction LocalizationTemporal Action Proposal GenerationVideo CaptioningDense Video CaptioningTemporal LocalizationTemporal Action Localization
PaperPDFCode(official)

Abstract

Due to the large memory footprint of untrimmed videos, current state-of-the-art video localization methods operate atop precomputed video clip features. These features are extracted from video encoders typically trained for trimmed action classification tasks, making such features not necessarily suitable for temporal localization. In this work, we propose a novel supervised pretraining paradigm for clip features that not only trains to classify activities but also considers background clips and global video information to improve temporal sensitivity. Extensive experiments show that using features trained with our novel pretraining strategy significantly improves the performance of recent state-of-the-art methods on three tasks: Temporal Action Localization, Action Proposal Generation, and Dense Video Captioning. We also show that our pretraining approach is effective across three encoder architectures and two pretraining datasets. We believe video feature encoding is an important building block for localization algorithms, and extracting temporally-sensitive features should be of paramount importance in building more accurate models. The code and pretrained models are available on our project website.

Results

TaskDatasetMetricValueModel
VideoActivityNet-1.3mAP35.81TSP
VideoActivityNet-1.3mAP IOU@0.551.26TSP
VideoActivityNet-1.3mAP IOU@0.7537.12TSP
VideoActivityNet-1.3mAP IOU@0.959.29TSP
VideoTHUMOS’14Avg mAP (0.3:0.7)50.46TSP
VideoTHUMOS’14mAP IOU@0.174.02TSP
VideoTHUMOS’14mAP IOU@0.272.29TSP
VideoTHUMOS’14mAP IOU@0.369.1TSP
VideoTHUMOS’14mAP IOU@0.463.3TSP
VideoTHUMOS’14mAP IOU@0.553.5TSP
VideoTHUMOS’14mAP IOU@0.640.4TSP
VideoTHUMOS’14mAP IOU@0.726TSP
VideoActivityNet-1.3AR@10076.63TSP
VideoActivityNet-1.3AUC (val)69.04TSP
Temporal Action LocalizationActivityNet-1.3mAP35.81TSP
Temporal Action LocalizationActivityNet-1.3mAP IOU@0.551.26TSP
Temporal Action LocalizationActivityNet-1.3mAP IOU@0.7537.12TSP
Temporal Action LocalizationActivityNet-1.3mAP IOU@0.959.29TSP
Temporal Action LocalizationTHUMOS’14Avg mAP (0.3:0.7)50.46TSP
Temporal Action LocalizationTHUMOS’14mAP IOU@0.174.02TSP
Temporal Action LocalizationTHUMOS’14mAP IOU@0.272.29TSP
Temporal Action LocalizationTHUMOS’14mAP IOU@0.369.1TSP
Temporal Action LocalizationTHUMOS’14mAP IOU@0.463.3TSP
Temporal Action LocalizationTHUMOS’14mAP IOU@0.553.5TSP
Temporal Action LocalizationTHUMOS’14mAP IOU@0.640.4TSP
Temporal Action LocalizationTHUMOS’14mAP IOU@0.726TSP
Temporal Action LocalizationActivityNet-1.3AR@10076.63TSP
Temporal Action LocalizationActivityNet-1.3AUC (val)69.04TSP
Zero-Shot LearningActivityNet-1.3mAP35.81TSP
Zero-Shot LearningActivityNet-1.3mAP IOU@0.551.26TSP
Zero-Shot LearningActivityNet-1.3mAP IOU@0.7537.12TSP
Zero-Shot LearningActivityNet-1.3mAP IOU@0.959.29TSP
Zero-Shot LearningTHUMOS’14Avg mAP (0.3:0.7)50.46TSP
Zero-Shot LearningTHUMOS’14mAP IOU@0.174.02TSP
Zero-Shot LearningTHUMOS’14mAP IOU@0.272.29TSP
Zero-Shot LearningTHUMOS’14mAP IOU@0.369.1TSP
Zero-Shot LearningTHUMOS’14mAP IOU@0.463.3TSP
Zero-Shot LearningTHUMOS’14mAP IOU@0.553.5TSP
Zero-Shot LearningTHUMOS’14mAP IOU@0.640.4TSP
Zero-Shot LearningTHUMOS’14mAP IOU@0.726TSP
Zero-Shot LearningActivityNet-1.3AR@10076.63TSP
Zero-Shot LearningActivityNet-1.3AUC (val)69.04TSP
Action LocalizationActivityNet-1.3mAP35.81TSP
Action LocalizationActivityNet-1.3mAP IOU@0.551.26TSP
Action LocalizationActivityNet-1.3mAP IOU@0.7537.12TSP
Action LocalizationActivityNet-1.3mAP IOU@0.959.29TSP
Action LocalizationTHUMOS’14Avg mAP (0.3:0.7)50.46TSP
Action LocalizationTHUMOS’14mAP IOU@0.174.02TSP
Action LocalizationTHUMOS’14mAP IOU@0.272.29TSP
Action LocalizationTHUMOS’14mAP IOU@0.369.1TSP
Action LocalizationTHUMOS’14mAP IOU@0.463.3TSP
Action LocalizationTHUMOS’14mAP IOU@0.553.5TSP
Action LocalizationTHUMOS’14mAP IOU@0.640.4TSP
Action LocalizationTHUMOS’14mAP IOU@0.726TSP
Action LocalizationActivityNet-1.3AR@10076.63TSP
Action LocalizationActivityNet-1.3AUC (val)69.04TSP
Video CaptioningActivityNet CaptionsBLEU-34.16TSP
Video CaptioningActivityNet CaptionsBLEU-42.02TSP
Video CaptioningActivityNet CaptionsMETEOR8.75TSP
Dense Video CaptioningActivityNet CaptionsBLEU-34.16TSP
Dense Video CaptioningActivityNet CaptionsBLEU-42.02TSP
Dense Video CaptioningActivityNet CaptionsMETEOR8.75TSP

Related Papers

DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action Recognition2025-07-16UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks2025-07-15Show, Tell and Summarize: Dense Video Captioning Using Visual Cue Aided Sentence Summarization2025-06-25Dense Video Captioning using Graph-based Sentence Summarization2025-06-25Including Semantic Information via Word Embeddings for Skeleton-based Action Recognition2025-06-23video-SALMONN 2: Captioning-Enhanced Audio-Visual Large Language Models2025-06-18Fine-Tuning Large Audio-Language Models with LoRA for Precise Temporal Localization of Prolonged Exposure Therapy Elements2025-06-11VersaVid-R1: A Versatile Video Understanding and Reasoning Model from Question Answering to Captioning Tasks2025-06-10