TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Query-Dependent Video Representation for Moment Retrieval ...

Query-Dependent Video Representation for Moment Retrieval and Highlight Detection

WonJun Moon, Sangeek Hyun, Sanguk Park, Dongchan Park, Jae-Pil Heo

2023-03-24CVPR 2023 1Video GroundingHighlight DetectionMoment RetrievalVideo UnderstandingRetrievalNatural Language Queries
PaperPDFCode(official)

Abstract

Recently, video moment retrieval and highlight detection (MR/HD) are being spotlighted as the demand for video understanding is drastically increased. The key objective of MR/HD is to localize the moment and estimate clip-wise accordance level, i.e., saliency score, to the given text query. Although the recent transformer-based models brought some advances, we found that these methods do not fully exploit the information of a given query. For example, the relevance between text query and video contents is sometimes neglected when predicting the moment and its saliency. To tackle this issue, we introduce Query-Dependent DETR (QD-DETR), a detection transformer tailored for MR/HD. As we observe the insignificant role of a given query in transformer architectures, our encoding module starts with cross-attention layers to explicitly inject the context of text query into video representation. Then, to enhance the model's capability of exploiting the query information, we manipulate the video-query pairs to produce irrelevant pairs. Such negative (irrelevant) video-query pairs are trained to yield low saliency scores, which in turn, encourages the model to estimate precise accordance between query-video pairs. Lastly, we present an input-adaptive saliency predictor which adaptively defines the criterion of saliency scores for the given video-query pairs. Our extensive studies verify the importance of building the query-dependent representation for MR/HD. Specifically, QD-DETR outperforms state-of-the-art methods on QVHighlights, TVSum, and Charades-STA datasets. Codes are available at github.com/wjun0830/QD-DETR.

Results

TaskDatasetMetricValueModel
VideoQVHighlightsR@1,IoU=0.562.4QD-DETR
VideoQVHighlightsR@1,IoU=0.744.98QD-DETR
Video RetrievalQVHighlightsR@1,IoU=0.562.4QD-DETR
Video RetrievalQVHighlightsR@1,IoU=0.744.98QD-DETR
Moment RetrievalCharades-STAR@1 IoU=0.557.31QD-DETR (Only Video)
Moment RetrievalCharades-STAR@1 IoU=0.732.55QD-DETR (Only Video)
Moment RetrievalQVHighlightsR@1 IoU=0.564.1QD-DETR (w/ PT)
Moment RetrievalQVHighlightsR@1 IoU=0.746.1QD-DETR (w/ PT)
Moment RetrievalQVHighlightsmAP40.62QD-DETR (w/ PT)
Moment RetrievalQVHighlightsmAP@0.564.3QD-DETR (w/ PT)
Moment RetrievalQVHighlightsmAP@0.7540.5QD-DETR (w/ PT)
Moment RetrievalQVHighlightsR@1 IoU=0.563.06QD-DETR (w/ audio)
Moment RetrievalQVHighlightsR@1 IoU=0.745.1QD-DETR (w/ audio)
Moment RetrievalQVHighlightsmAP40.19QD-DETR (w/ audio)
Moment RetrievalQVHighlightsmAP@0.563.04QD-DETR (w/ audio)
Moment RetrievalQVHighlightsmAP@0.7540.1QD-DETR (w/ audio)
Moment RetrievalQVHighlightsR@1 IoU=0.563.2QD-DETR (only Video w/ PT ASR Captions)
Moment RetrievalQVHighlightsR@1 IoU=0.745.2QD-DETR (only Video w/ PT ASR Captions)
Moment RetrievalQVHighlightsmAP40QD-DETR (only Video w/ PT ASR Captions)
Moment RetrievalQVHighlightsmAP@0.563.4QD-DETR (only Video w/ PT ASR Captions)
Moment RetrievalQVHighlightsmAP@0.7540.4QD-DETR (only Video w/ PT ASR Captions)
Moment RetrievalQVHighlightsR@1 IoU=0.562.4QD-DETR (only Video)
Moment RetrievalQVHighlightsR@1 IoU=0.744.98QD-DETR (only Video)
Moment RetrievalQVHighlightsmAP39.86QD-DETR (only Video)
Moment RetrievalQVHighlightsmAP@0.562.52QD-DETR (only Video)
Moment RetrievalQVHighlightsmAP@0.7539.88QD-DETR (only Video)
Highlight DetectionTvSummAP86.6QD-DETR
Highlight DetectionTvSummAP85QD-DETR (only Video)
Highlight DetectionQVHighlightsHit@162.87QD-DETR
Highlight DetectionQVHighlightsmAP39.04QD-DETR
Highlight DetectionQVHighlightsHit@162.4QD-DETR (only Video)
Highlight DetectionQVHighlightsmAP38.94QD-DETR (only Video)
Highlight DetectionQVHighlightsHit@162.27QD-DETR (w/ PT)
Highlight DetectionQVHighlightsmAP38.52QD-DETR (w/ PT)
Highlight DetectionQVHighlightsHit@161.91QD-DETR (only Video w/ PT)
Video GroundingQVHighlightsR@1,IoU=0.562.4QD-DETR
Video GroundingQVHighlightsR@1,IoU=0.744.98QD-DETR
16kTvSummAP86.6QD-DETR
16kTvSummAP85QD-DETR (only Video)
16kQVHighlightsHit@162.87QD-DETR
16kQVHighlightsmAP39.04QD-DETR
16kQVHighlightsHit@162.4QD-DETR (only Video)
16kQVHighlightsmAP38.94QD-DETR (only Video)
16kQVHighlightsHit@162.27QD-DETR (w/ PT)
16kQVHighlightsmAP38.52QD-DETR (w/ PT)
16kQVHighlightsHit@161.91QD-DETR (only Video w/ PT)

Related Papers

VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding2025-07-17From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17Developing Visual Augmented Q&A System using Scalable Vision Embedding Retrieval & Late Interaction Re-ranker2025-07-16Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16Context-Aware Search and Retrieval Over Erasure Channels2025-07-16