TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Saliency-Guided DETR for Moment Retrieval and Highlight De...

Saliency-Guided DETR for Moment Retrieval and Highlight Detection

Aleksandr Gordeev, Vladimir Dokholyan, Irina Tolstykh, Maksim Kuprashevich

2024-10-02Highlight DetectionMoment RetrievalRetrievalNatural Language QueriesTemporal Action LocalizationZero-shot Moment RetrievalNatural Language Moment Retrieval
PaperPDFCode(official)

Abstract

Existing approaches for video moment retrieval and highlight detection are not able to align text and video features efficiently, resulting in unsatisfying performance and limited production usage. To address this, we propose a novel architecture that utilizes recent foundational video models designed for such alignment. Combined with the introduced Saliency-Guided Cross Attention mechanism and a hybrid DETR architecture, our approach significantly enhances performance in both moment retrieval and highlight detection tasks. For even better improvement, we developed InterVid-MR, a large-scale and high-quality dataset for pretraining. Using it, our architecture achieves state-of-the-art results on the QVHighlights, Charades-STA and TACoS benchmarks. The proposed approach provides an efficient and scalable solution for both zero-shot and fine-tuning scenarios in video-language tasks.

Results

TaskDatasetMetricValueModel
VideoTACoSR@1,IoU=0.358.1SG-DETR (w/ PT)
VideoTACoSR@1,IoU=0.546.4SG-DETR (w/ PT)
VideoTACoSR@1,IoU=0.733.9SG-DETR (w/ PT)
VideoTACoSmIoU42.4SG-DETR (w/ PT)
VideoTACoSR@1,IoU=0.356.71SG-DETR
VideoTACoSR@1,IoU=0.544.7SG-DETR
VideoTACoSR@1,IoU=0.729.9SG-DETR
VideoTACoSmIoU40.9SG-DETR
Moment RetrievalCharades-STAR@1 IoU=0.571.1SG-DETR (w/ PT)
Moment RetrievalCharades-STAR@1 IoU=0.752.8SG-DETR (w/ PT)
Moment RetrievalCharades-STAR@1 IoU=0.570.2SG-DETR
Moment RetrievalCharades-STAR@1 IoU=0.749.5SG-DETR
Moment RetrievalQVHighlightsR@1 IoU=0.574.2SG-DETR (w/ PT)
Moment RetrievalQVHighlightsR@1 IoU=0.760.4SG-DETR (w/ PT)
Moment RetrievalQVHighlightsmAP58.8SG-DETR (w/ PT)
Moment RetrievalQVHighlightsmAP@0.576.2SG-DETR (w/ PT)
Moment RetrievalQVHighlightsmAP@0.7560.8SG-DETR (w/ PT)
Moment RetrievalQVHighlightsR@1 IoU=0.572.2SG-DETR
Moment RetrievalQVHighlightsR@1 IoU=0.756.6SG-DETR
Moment RetrievalQVHighlightsmAP54.1SG-DETR
Moment RetrievalQVHighlightsmAP@0.573.2SG-DETR
Moment RetrievalQVHighlightsmAP@0.7555.8SG-DETR
Moment RetrievalQVHighlightsR1@0.563.9SG-DETR (ZS)
Moment RetrievalQVHighlightsR1@0.749.6SG-DETR (ZS)
Moment RetrievalQVHighlightsmAP48.3SG-DETR (ZS)
Moment RetrievalQVHighlightsmAP@0.567.5SG-DETR (ZS)
Moment RetrievalQVHighlightsmAP@0.7549SG-DETR (ZS)
Highlight DetectionTvSummAP87.1SG-DETR
Highlight DetectionYouTube HighlightsmAP78SG-DETR (w/ PT)
Highlight DetectionYouTube HighlightsmAP76.7SG-DETR
Highlight DetectionQVHighlightsHit@171SG-DETR (w/ PT)
Highlight DetectionQVHighlightsmAP44.7SG-DETR (w/ PT)
Highlight DetectionQVHighlightsHit@169.13SG-DETR
Highlight DetectionQVHighlightsmAP43.76SG-DETR
16kTvSummAP87.1SG-DETR
16kYouTube HighlightsmAP78SG-DETR (w/ PT)
16kYouTube HighlightsmAP76.7SG-DETR
16kQVHighlightsHit@171SG-DETR (w/ PT)
16kQVHighlightsmAP44.7SG-DETR (w/ PT)
16kQVHighlightsHit@169.13SG-DETR
16kQVHighlightsmAP43.76SG-DETR

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17Developing Visual Augmented Q&A System using Scalable Vision Embedding Retrieval & Late Interaction Re-ranker2025-07-16Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16Context-Aware Search and Retrieval Over Erasure Channels2025-07-16DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action Recognition2025-07-16