Saliency-Guided DETR for Moment Retrieval and Highlight Detection

Aleksandr Gordeev, Vladimir Dokholyan, Irina Tolstykh, Maksim Kuprashevich

2024-10-02Highlight Detection Moment Retrieval Retrieval Natural Language Queries Temporal Action Localization Zero-shot Moment Retrieval Natural Language Moment Retrieval

Paper PDF Code(official)

Abstract

Existing approaches for video moment retrieval and highlight detection are not able to align text and video features efficiently, resulting in unsatisfying performance and limited production usage. To address this, we propose a novel architecture that utilizes recent foundational video models designed for such alignment. Combined with the introduced Saliency-Guided Cross Attention mechanism and a hybrid DETR architecture, our approach significantly enhances performance in both moment retrieval and highlight detection tasks. For even better improvement, we developed InterVid-MR, a large-scale and high-quality dataset for pretraining. Using it, our architecture achieves state-of-the-art results on the QVHighlights, Charades-STA and TACoS benchmarks. The proposed approach provides an efficient and scalable solution for both zero-shot and fine-tuning scenarios in video-language tasks.

Results

Task	Dataset	Metric	Value	Model
Video	TACoS	R@1,IoU=0.3	58.1	SG-DETR (w/ PT)
Video	TACoS	R@1,IoU=0.5	46.4	SG-DETR (w/ PT)
Video	TACoS	R@1,IoU=0.7	33.9	SG-DETR (w/ PT)
Video	TACoS	mIoU	42.4	SG-DETR (w/ PT)
Video	TACoS	R@1,IoU=0.3	56.71	SG-DETR
Video	TACoS	R@1,IoU=0.5	44.7	SG-DETR
Video	TACoS	R@1,IoU=0.7	29.9	SG-DETR
Video	TACoS	mIoU	40.9	SG-DETR
Moment Retrieval	Charades-STA	R@1 IoU=0.5	71.1	SG-DETR (w/ PT)
Moment Retrieval	Charades-STA	R@1 IoU=0.7	52.8	SG-DETR (w/ PT)
Moment Retrieval	Charades-STA	R@1 IoU=0.5	70.2	SG-DETR
Moment Retrieval	Charades-STA	R@1 IoU=0.7	49.5	SG-DETR
Moment Retrieval	QVHighlights	R@1 IoU=0.5	74.2	SG-DETR (w/ PT)
Moment Retrieval	QVHighlights	R@1 IoU=0.7	60.4	SG-DETR (w/ PT)
Moment Retrieval	QVHighlights	mAP	58.8	SG-DETR (w/ PT)
Moment Retrieval	QVHighlights	mAP@0.5	76.2	SG-DETR (w/ PT)
Moment Retrieval	QVHighlights	mAP@0.75	60.8	SG-DETR (w/ PT)
Moment Retrieval	QVHighlights	R@1 IoU=0.5	72.2	SG-DETR
Moment Retrieval	QVHighlights	R@1 IoU=0.7	56.6	SG-DETR
Moment Retrieval	QVHighlights	mAP	54.1	SG-DETR
Moment Retrieval	QVHighlights	mAP@0.5	73.2	SG-DETR
Moment Retrieval	QVHighlights	mAP@0.75	55.8	SG-DETR
Moment Retrieval	QVHighlights	R1@0.5	63.9	SG-DETR (ZS)
Moment Retrieval	QVHighlights	R1@0.7	49.6	SG-DETR (ZS)
Moment Retrieval	QVHighlights	mAP	48.3	SG-DETR (ZS)
Moment Retrieval	QVHighlights	mAP@0.5	67.5	SG-DETR (ZS)
Moment Retrieval	QVHighlights	mAP@0.75	49	SG-DETR (ZS)
Highlight Detection	TvSum	mAP	87.1	SG-DETR
Highlight Detection	YouTube Highlights	mAP	78	SG-DETR (w/ PT)
Highlight Detection	YouTube Highlights	mAP	76.7	SG-DETR
Highlight Detection	QVHighlights	Hit@1	71	SG-DETR (w/ PT)
Highlight Detection	QVHighlights	mAP	44.7	SG-DETR (w/ PT)
Highlight Detection	QVHighlights	Hit@1	69.13	SG-DETR
Highlight Detection	QVHighlights	mAP	43.76	SG-DETR
16k	TvSum	mAP	87.1	SG-DETR
16k	YouTube Highlights	mAP	78	SG-DETR (w/ PT)
16k	YouTube Highlights	mAP	76.7	SG-DETR
16k	QVHighlights	Hit@1	71	SG-DETR (w/ PT)
16k	QVHighlights	mAP	44.7	SG-DETR (w/ PT)
16k	QVHighlights	Hit@1	69.13	SG-DETR
16k	QVHighlights	mAP	43.76	SG-DETR

Saliency-Guided DETR for Moment Retrieval and Highlight Detection

Abstract

Results

Related Papers

Saliency-Guided DETR for Moment Retrieval and Highlight Detection

Abstract

Results

Related Papers