RGNet: A Unified Clip Retrieval and Grounding Network for Long Videos

Tanveer Hannan, Md Mohaiminul Islam, Thomas Seidl, Gedas Bertasius

2023-12-11Video-Text Retrieval Text Retrieval Retrieval Natural Language Queries Natural Language Moment Retrieval

Abstract

Locating specific moments within long videos (20-120 minutes) presents a significant challenge, akin to finding a needle in a haystack. Adapting existing short video (5-30 seconds) grounding methods to this problem yields poor performance. Since most real life videos, such as those on YouTube and AR/VR, are lengthy, addressing this issue is crucial. Existing methods typically operate in two stages: clip retrieval and grounding. However, this disjoint process limits the retrieval module's fine-grained event understanding, crucial for specific moment detection. We propose RGNet which deeply integrates clip retrieval and grounding into a single network capable of processing long videos into multiple granular levels, e.g., clips and frames. Its core component is a novel transformer encoder, RG-Encoder, that unifies the two stages through shared features and mutual optimization. The encoder incorporates a sparse attention mechanism and an attention loss to model both granularity jointly. Moreover, we introduce a contrastive clip sampling technique to mimic the long video paradigm closely during training. RGNet surpasses prior methods, showcasing state-of-the-art performance on long video temporal grounding (LVTG) datasets MAD and Ego4D.

Results

Task	Dataset	Metric	Value	Model
Video	MAD	R@1,IoU=0.1	12.43	RGNet
Video	MAD	R@1,IoU=0.3	9.48	RGNet
Video	MAD	R@1,IoU=0.5	5.61	RGNet
Video	MAD	R@5,IoU=0.1	25.12	RGNet
Video	MAD	R@5,IoU=0.3	18.72	RGNet
Video	MAD	R@5,IoU=0.5	10.86	RGNet
Natural Language Queries	Ego4D	R@1 IoU=0.3	20.63	RGNet
Natural Language Queries	Ego4D	R@1 IoU=0.5	12.47	RGNet
Natural Language Queries	Ego4D	R@1 Mean(0.3 and 0.5)	16.55	RGNet
Natural Language Queries	Ego4D	R@5 IoU=0.3	41.67	RGNet
Natural Language Queries	Ego4D	R@5 IoU=0.5	25.08	RGNet

RGNet: A Unified Clip Retrieval and Grounding Network for Long Videos

Abstract

Results

Related Papers

RGNet: A Unified Clip Retrieval and Grounding Network for Long Videos

Abstract

Results

Related Papers