BAM-DETR: Boundary-Aligned Moment Detection Transformer for Temporal Sentence Grounding in Videos

Pilhyeon Lee, Hyeran Byun

2023-11-30Temporal Sentence Grounding Moment Retrieval Natural Language Moment Retrieval

Abstract

Temporal sentence grounding aims to localize moments relevant to a language description. Recently, DETR-like approaches achieved notable progress by predicting the center and length of a target moment. However, they suffer from the issue of center misalignment raised by the inherent ambiguity of moment centers, leading to inaccurate predictions. To remedy this problem, we propose a novel boundary-oriented moment formulation. In our paradigm, the model no longer needs to find the precise center but instead suffices to predict any anchor point within the interval, from which the boundaries are directly estimated. Based on this idea, we design a boundary-aligned moment detection transformer, equipped with a dual-pathway decoding process. Specifically, it refines the anchor and boundaries within parallel pathways using global and boundary-focused attention, respectively. This separate design allows the model to focus on desirable regions, enabling precise refinement of moment predictions. Further, we propose a quality-based ranking method, ensuring that proposals with high localization qualities are prioritized over incomplete ones. Experiments on three benchmarks validate the effectiveness of the proposed methods. The code is available at https://github.com/Pilhyeon/BAM-DETR.

Results

Task	Dataset	Metric	Value	Model
Video	TACoS	R@1,IoU=0.3	56.69	BAM-DETR
Video	TACoS	R@1,IoU=0.5	41.54	BAM-DETR
Video	TACoS	R@1,IoU=0.7	26.77	BAM-DETR
Video	TACoS	mIoU	39.31	BAM-DETR
Moment Retrieval	Charades-STA	R@1 IoU=0.5	59.95	BAM-DETR
Moment Retrieval	Charades-STA	R@1 IoU=0.7	39.38	BAM-DETR
Moment Retrieval	QVHighlights	R@1 IoU=0.5	64.07	BAM-DETR (w/ audio)
Moment Retrieval	QVHighlights	R@1 IoU=0.7	48.12	BAM-DETR (w/ audio)
Moment Retrieval	QVHighlights	mAP	46.91	BAM-DETR (w/ audio)
Moment Retrieval	QVHighlights	mAP@0.5	65.61	BAM-DETR (w/ audio)
Moment Retrieval	QVHighlights	mAP@0.75	47.51	BAM-DETR (w/ audio)
Moment Retrieval	QVHighlights	R@1 IoU=0.5	63.88	BAM-DETR (w/ PT ASR Captions)
Moment Retrieval	QVHighlights	R@1 IoU=0.7	47.92	BAM-DETR (w/ PT ASR Captions)
Moment Retrieval	QVHighlights	mAP	46.67	BAM-DETR (w/ PT ASR Captions)
Moment Retrieval	QVHighlights	mAP@0.5	66.33	BAM-DETR (w/ PT ASR Captions)
Moment Retrieval	QVHighlights	mAP@0.75	48.22	BAM-DETR (w/ PT ASR Captions)
Moment Retrieval	QVHighlights	R@1 IoU=0.5	62.71	BAM-DETR
Moment Retrieval	QVHighlights	R@1 IoU=0.7	48.64	BAM-DETR
Moment Retrieval	QVHighlights	mAP	45.36	BAM-DETR
Moment Retrieval	QVHighlights	mAP@0.5	64.57	BAM-DETR
Moment Retrieval	QVHighlights	mAP@0.75	46.33	BAM-DETR

BAM-DETR: Boundary-Aligned Moment Detection Transformer for Temporal Sentence Grounding in Videos

Abstract

Results

Related Papers

BAM-DETR: Boundary-Aligned Moment Detection Transformer for Temporal Sentence Grounding in Videos

Abstract

Results

Related Papers