TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Multi-shot Temporal Event Localization: a Benchmark

Multi-shot Temporal Event Localization: a Benchmark

Xiaolong Liu, Yao Hu, Song Bai, Fei Ding, Xiang Bai, Philip H. S. Torr

2020-12-17CVPR 2021 1Action LocalizationTemporal Action Localization
PaperPDFCode(official)

Abstract

Current developments in temporal event or action localization usually target actions captured by a single camera. However, extensive events or actions in the wild may be captured as a sequence of shots by multiple cameras at different positions. In this paper, we propose a new and challenging task called multi-shot temporal event localization, and accordingly, collect a large scale dataset called MUlti-Shot EventS (MUSES). MUSES has 31,477 event instances for a total of 716 video hours. The core nature of MUSES is the frequent shot cuts, for an average of 19 shots per instance and 176 shots per video, which induces large intrainstance variations. Our comprehensive evaluations show that the state-of-the-art method in temporal action localization only achieves an mAP of 13.1% at IoU=0.5. As a minor contribution, we present a simple baseline approach for handling the intra-instance variations, which reports an mAP of 18.9% on MUSES and 56.9% on THUMOS14 at IoU=0.5. To facilitate research in this direction, we release the dataset and the project code at https://songbai.site/muses/ .

Results

TaskDatasetMetricValueModel
VideoTHUMOS’14Avg mAP (0.3:0.7)53.4MUSES
VideoTHUMOS’14mAP IOU@0.368.9MUSES
VideoTHUMOS’14mAP IOU@0.464MUSES
VideoTHUMOS’14mAP IOU@0.556.9MUSES
VideoTHUMOS’14mAP IOU@0.646.3MUSES
VideoTHUMOS’14mAP IOU@0.731MUSES
VideoMUSESmAP18.6MUSES
VideoMUSESmAP@0.325.9MUSES
VideoMUSESmAP@0.422.6MUSES
VideoMUSESmAP@0.518.9MUSES
VideoMUSESmAP@0.615MUSES
VideoMUSESmAP@0.710.6MUSES
Temporal Action LocalizationTHUMOS’14Avg mAP (0.3:0.7)53.4MUSES
Temporal Action LocalizationTHUMOS’14mAP IOU@0.368.9MUSES
Temporal Action LocalizationTHUMOS’14mAP IOU@0.464MUSES
Temporal Action LocalizationTHUMOS’14mAP IOU@0.556.9MUSES
Temporal Action LocalizationTHUMOS’14mAP IOU@0.646.3MUSES
Temporal Action LocalizationTHUMOS’14mAP IOU@0.731MUSES
Temporal Action LocalizationMUSESmAP18.6MUSES
Temporal Action LocalizationMUSESmAP@0.325.9MUSES
Temporal Action LocalizationMUSESmAP@0.422.6MUSES
Temporal Action LocalizationMUSESmAP@0.518.9MUSES
Temporal Action LocalizationMUSESmAP@0.615MUSES
Temporal Action LocalizationMUSESmAP@0.710.6MUSES
Zero-Shot LearningTHUMOS’14Avg mAP (0.3:0.7)53.4MUSES
Zero-Shot LearningTHUMOS’14mAP IOU@0.368.9MUSES
Zero-Shot LearningTHUMOS’14mAP IOU@0.464MUSES
Zero-Shot LearningTHUMOS’14mAP IOU@0.556.9MUSES
Zero-Shot LearningTHUMOS’14mAP IOU@0.646.3MUSES
Zero-Shot LearningTHUMOS’14mAP IOU@0.731MUSES
Zero-Shot LearningMUSESmAP18.6MUSES
Zero-Shot LearningMUSESmAP@0.325.9MUSES
Zero-Shot LearningMUSESmAP@0.422.6MUSES
Zero-Shot LearningMUSESmAP@0.518.9MUSES
Zero-Shot LearningMUSESmAP@0.615MUSES
Zero-Shot LearningMUSESmAP@0.710.6MUSES
Action LocalizationTHUMOS’14Avg mAP (0.3:0.7)53.4MUSES
Action LocalizationTHUMOS’14mAP IOU@0.368.9MUSES
Action LocalizationTHUMOS’14mAP IOU@0.464MUSES
Action LocalizationTHUMOS’14mAP IOU@0.556.9MUSES
Action LocalizationTHUMOS’14mAP IOU@0.646.3MUSES
Action LocalizationTHUMOS’14mAP IOU@0.731MUSES
Action LocalizationMUSESmAP18.6MUSES
Action LocalizationMUSESmAP@0.325.9MUSES
Action LocalizationMUSESmAP@0.422.6MUSES
Action LocalizationMUSESmAP@0.518.9MUSES
Action LocalizationMUSESmAP@0.615MUSES
Action LocalizationMUSESmAP@0.710.6MUSES

Related Papers

DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action Recognition2025-07-16Including Semantic Information via Word Embeddings for Skeleton-based Action Recognition2025-06-23Zero-Shot Temporal Interaction Localization for Egocentric Videos2025-06-04A Review on Coarse to Fine-Grained Animal Action Recognition2025-06-01LLM-powered Query Expansion for Enhancing Boundary Prediction in Language-driven Action Localization2025-05-30CLIP-AE: CLIP-assisted Cross-view Audio-Visual Enhancement for Unsupervised Temporal Action Localization2025-05-29DeepConvContext: A Multi-Scale Approach to Timeseries Classification in Human Activity Recognition2025-05-27ProTAL: A Drag-and-Link Video Programming Framework for Temporal Action Localization2025-05-23