Correlation-Guided Query-Dependency Calibration for Video Temporal Grounding

WonJun Moon, Sangeek Hyun, SuBeen Lee, Jae-Pil Heo

2023-11-15Representation Learning Highlight Detection Moment Retrieval Natural Language Moment Retrieval

Abstract

Temporal Grounding is to identify specific moments or highlights from a video corresponding to textual descriptions. Typical approaches in temporal grounding treat all video clips equally during the encoding process regardless of their semantic relevance with the text query. Therefore, we propose Correlation-Guided DEtection TRansformer (CG-DETR), exploring to provide clues for query-associated video clips within the cross-modal attention. First, we design an adaptive cross-attention with dummy tokens. Dummy tokens conditioned by text query take portions of the attention weights, preventing irrelevant video clips from being represented by the text query. Yet, not all words equally inherit the text query's correlation to video clips. Thus, we further guide the cross-attention map by inferring the fine-grained correlation between video clips and words. We enable this by learning a joint embedding space for high-level concepts, i.e., moment and sentence level, and inferring the clip-word correlation. Lastly, we exploit the moment-specific characteristics and combine them with the context of each video to form a moment-adaptive saliency detector. By exploiting the degrees of text engagement in each video clip, it precisely measures the highlightness of each clip. CG-DETR achieves state-of-the-art results on various benchmarks for temporal grounding. Codes are available at https://github.com/wjun0830/CGDETR.

Results

Task	Dataset	Metric	Value	Model
Video	TACoS	R@1,IoU=0.3	52.23	CG-DETR
Video	TACoS	R@1,IoU=0.5	39.61	CG-DETR
Video	TACoS	R@1,IoU=0.7	22.23	CG-DETR
Video	TACoS	mIoU	36.48	CG-DETR
Moment Retrieval	Charades-STA	R@1 IoU=0.5	58.44	CG-DETR
Moment Retrieval	Charades-STA	R@1 IoU=0.7	36.34	CG-DETR
Moment Retrieval	QVHighlights	R@1 IoU=0.5	68.48	CG-DETR (w/ PT)
Moment Retrieval	QVHighlights	R@1 IoU=0.7	53.11	CG-DETR (w/ PT)
Moment Retrieval	QVHighlights	mAP	47.97	CG-DETR (w/ PT)
Moment Retrieval	QVHighlights	mAP@0.5	69.4	CG-DETR (w/ PT)
Moment Retrieval	QVHighlights	mAP@0.75	49.12	CG-DETR (w/ PT)
Moment Retrieval	QVHighlights	R@1 IoU=0.5	65.43	CG-DETR
Moment Retrieval	QVHighlights	R@1 IoU=0.7	48.38	CG-DETR
Moment Retrieval	QVHighlights	mAP	42.86	CG-DETR
Moment Retrieval	QVHighlights	mAP@0.5	64.51	CG-DETR
Moment Retrieval	QVHighlights	mAP@0.75	42.77	CG-DETR
Highlight Detection	TvSum	mAP	86.8	CG-DETR
Highlight Detection	YouTube Highlights	mAP	75.9	CG-DETR
Highlight Detection	QVHighlights	Hit@1	66.6	CG-DETR (w/ PT)
Highlight Detection	QVHighlights	mAP	40.71	CG-DETR (w/ PT)
Highlight Detection	QVHighlights	Hit@1	66.21	CG-DETR
Highlight Detection	QVHighlights	mAP	40.33	CG-DETR
16k	TvSum	mAP	86.8	CG-DETR
16k	YouTube Highlights	mAP	75.9	CG-DETR
16k	QVHighlights	Hit@1	66.6	CG-DETR (w/ PT)
16k	QVHighlights	mAP	40.71	CG-DETR (w/ PT)
16k	QVHighlights	Hit@1	66.21	CG-DETR
16k	QVHighlights	mAP	40.33	CG-DETR

Correlation-Guided Query-Dependency Calibration for Video Temporal Grounding

Abstract

Results

Related Papers

Correlation-Guided Query-Dependency Calibration for Video Temporal Grounding

Abstract

Results

Related Papers