Query-Dependent Video Representation for Moment Retrieval and Highlight Detection

WonJun Moon, Sangeek Hyun, Sanguk Park, Dongchan Park, Jae-Pil Heo

2023-03-24CVPR 2023 1Video Grounding Highlight Detection Moment Retrieval Video Understanding Retrieval Natural Language Queries

Paper PDF Code(official)

Abstract

Recently, video moment retrieval and highlight detection (MR/HD) are being spotlighted as the demand for video understanding is drastically increased. The key objective of MR/HD is to localize the moment and estimate clip-wise accordance level, i.e., saliency score, to the given text query. Although the recent transformer-based models brought some advances, we found that these methods do not fully exploit the information of a given query. For example, the relevance between text query and video contents is sometimes neglected when predicting the moment and its saliency. To tackle this issue, we introduce Query-Dependent DETR (QD-DETR), a detection transformer tailored for MR/HD. As we observe the insignificant role of a given query in transformer architectures, our encoding module starts with cross-attention layers to explicitly inject the context of text query into video representation. Then, to enhance the model's capability of exploiting the query information, we manipulate the video-query pairs to produce irrelevant pairs. Such negative (irrelevant) video-query pairs are trained to yield low saliency scores, which in turn, encourages the model to estimate precise accordance between query-video pairs. Lastly, we present an input-adaptive saliency predictor which adaptively defines the criterion of saliency scores for the given video-query pairs. Our extensive studies verify the importance of building the query-dependent representation for MR/HD. Specifically, QD-DETR outperforms state-of-the-art methods on QVHighlights, TVSum, and Charades-STA datasets. Codes are available at github.com/wjun0830/QD-DETR.

Results

Task	Dataset	Metric	Value	Model
Video	QVHighlights	R@1,IoU=0.5	62.4	QD-DETR
Video	QVHighlights	R@1,IoU=0.7	44.98	QD-DETR
Video Retrieval	QVHighlights	R@1,IoU=0.5	62.4	QD-DETR
Video Retrieval	QVHighlights	R@1,IoU=0.7	44.98	QD-DETR
Moment Retrieval	Charades-STA	R@1 IoU=0.5	57.31	QD-DETR (Only Video)
Moment Retrieval	Charades-STA	R@1 IoU=0.7	32.55	QD-DETR (Only Video)
Moment Retrieval	QVHighlights	R@1 IoU=0.5	64.1	QD-DETR (w/ PT)
Moment Retrieval	QVHighlights	R@1 IoU=0.7	46.1	QD-DETR (w/ PT)
Moment Retrieval	QVHighlights	mAP	40.62	QD-DETR (w/ PT)
Moment Retrieval	QVHighlights	mAP@0.5	64.3	QD-DETR (w/ PT)
Moment Retrieval	QVHighlights	mAP@0.75	40.5	QD-DETR (w/ PT)
Moment Retrieval	QVHighlights	R@1 IoU=0.5	63.06	QD-DETR (w/ audio)
Moment Retrieval	QVHighlights	R@1 IoU=0.7	45.1	QD-DETR (w/ audio)
Moment Retrieval	QVHighlights	mAP	40.19	QD-DETR (w/ audio)
Moment Retrieval	QVHighlights	mAP@0.5	63.04	QD-DETR (w/ audio)
Moment Retrieval	QVHighlights	mAP@0.75	40.1	QD-DETR (w/ audio)
Moment Retrieval	QVHighlights	R@1 IoU=0.5	63.2	QD-DETR (only Video w/ PT ASR Captions)
Moment Retrieval	QVHighlights	R@1 IoU=0.7	45.2	QD-DETR (only Video w/ PT ASR Captions)
Moment Retrieval	QVHighlights	mAP	40	QD-DETR (only Video w/ PT ASR Captions)
Moment Retrieval	QVHighlights	mAP@0.5	63.4	QD-DETR (only Video w/ PT ASR Captions)
Moment Retrieval	QVHighlights	mAP@0.75	40.4	QD-DETR (only Video w/ PT ASR Captions)
Moment Retrieval	QVHighlights	R@1 IoU=0.5	62.4	QD-DETR (only Video)
Moment Retrieval	QVHighlights	R@1 IoU=0.7	44.98	QD-DETR (only Video)
Moment Retrieval	QVHighlights	mAP	39.86	QD-DETR (only Video)
Moment Retrieval	QVHighlights	mAP@0.5	62.52	QD-DETR (only Video)
Moment Retrieval	QVHighlights	mAP@0.75	39.88	QD-DETR (only Video)
Highlight Detection	TvSum	mAP	86.6	QD-DETR
Highlight Detection	TvSum	mAP	85	QD-DETR (only Video)
Highlight Detection	QVHighlights	Hit@1	62.87	QD-DETR
Highlight Detection	QVHighlights	mAP	39.04	QD-DETR
Highlight Detection	QVHighlights	Hit@1	62.4	QD-DETR (only Video)
Highlight Detection	QVHighlights	mAP	38.94	QD-DETR (only Video)
Highlight Detection	QVHighlights	Hit@1	62.27	QD-DETR (w/ PT)
Highlight Detection	QVHighlights	mAP	38.52	QD-DETR (w/ PT)
Highlight Detection	QVHighlights	Hit@1	61.91	QD-DETR (only Video w/ PT)
Video Grounding	QVHighlights	R@1,IoU=0.5	62.4	QD-DETR
Video Grounding	QVHighlights	R@1,IoU=0.7	44.98	QD-DETR
16k	TvSum	mAP	86.6	QD-DETR
16k	TvSum	mAP	85	QD-DETR (only Video)
16k	QVHighlights	Hit@1	62.87	QD-DETR
16k	QVHighlights	mAP	39.04	QD-DETR
16k	QVHighlights	Hit@1	62.4	QD-DETR (only Video)
16k	QVHighlights	mAP	38.94	QD-DETR (only Video)
16k	QVHighlights	Hit@1	62.27	QD-DETR (w/ PT)
16k	QVHighlights	mAP	38.52	QD-DETR (w/ PT)
16k	QVHighlights	Hit@1	61.91	QD-DETR (only Video w/ PT)

Query-Dependent Video Representation for Moment Retrieval and Highlight Detection

Abstract

Results

Related Papers

Query-Dependent Video Representation for Moment Retrieval and Highlight Detection

Abstract

Results

Related Papers