Localizing Moments in Long Video Via Multimodal Guidance

Wayner Barrios, Mattia Soldan, Alberto Mario Ceballos-Arroyo, Fabian Caba Heilbron, Bernard Ghanem

2023-02-26ICCV 2023 1Video Grounding Natural Language Visual Grounding Video Understanding Natural Language Moment Retrieval

Paper PDF Code(official)

Abstract

The recent introduction of the large-scale, long-form MAD and Ego4D datasets has enabled researchers to investigate the performance of current state-of-the-art methods for video grounding in the long-form setup, with interesting findings: current grounding methods alone fail at tackling this challenging task and setup due to their inability to process long video sequences. In this paper, we propose a method for improving the performance of natural language grounding in long videos by identifying and pruning out non-describable windows. We design a guided grounding framework consisting of a Guidance Model and a base grounding model. The Guidance Model emphasizes describable windows, while the base grounding model analyzes short temporal windows to determine which segments accurately match a given language query. We offer two designs for the Guidance Model: Query-Agnostic and Query-Dependent, which balance efficiency and accuracy. Experiments demonstrate that our proposed method outperforms state-of-the-art models by 4.1% in MAD and 4.52% in Ego4D (NLQ), respectively. Code, data and MAD's audio features necessary to reproduce our experiments are available at: https://github.com/waybarrios/guidance-based-video-grounding.

Results

Task	Dataset	Metric	Value	Model
Video	MAD	R@1,IoU=0.1	9.3	Zero-Shot CLIP + Guidance Model
Video	MAD	R@1,IoU=0.3	4.65	Zero-Shot CLIP + Guidance Model
Video	MAD	R@1,IoU=0.5	2.16	Zero-Shot CLIP + Guidance Model
Video	MAD	R@10,IoU=0.1	24.3	Zero-Shot CLIP + Guidance Model
Video	MAD	R@10,IoU=0.3	17.73	Zero-Shot CLIP + Guidance Model
Video	MAD	R@10,IoU=0.5	11.09	Zero-Shot CLIP + Guidance Model
Video	MAD	R@100,IoU=0.1	47.35	Zero-Shot CLIP + Guidance Model
Video	MAD	R@100,IoU=0.3	39.58	Zero-Shot CLIP + Guidance Model
Video	MAD	R@100,IoU=0.5	29.68	Zero-Shot CLIP + Guidance Model
Video	MAD	R@5,IoU=0.1	18.96	Zero-Shot CLIP + Guidance Model
Video	MAD	R@5,IoU=0.3	13.06	Zero-Shot CLIP + Guidance Model
Video	MAD	R@5,IoU=0.5	7.4	Zero-Shot CLIP + Guidance Model
Video	MAD	R@50,IoU=0.1	39.79	Zero-Shot CLIP + Guidance Model
Video	MAD	R@50,IoU=0.3	32.23	Zero-Shot CLIP + Guidance Model
Video	MAD	R@50,IoU=0.5	23.21	Zero-Shot CLIP + Guidance Model
Video	MAD	R@1,IoU=0.1	5.6	VLG-Net + Guidance Model
Video	MAD	R@1,IoU=0.3	4.28	VLG-Net + Guidance Model
Video	MAD	R@1,IoU=0.5	2.48	VLG-Net + Guidance Model
Video	MAD	R@10,IoU=0.1	23.64	VLG-Net + Guidance Model
Video	MAD	R@10,IoU=0.3	19.86	VLG-Net + Guidance Model
Video	MAD	R@10,IoU=0.5	13.72	VLG-Net + Guidance Model
Video	MAD	R@100,IoU=0.1	55.59	VLG-Net + Guidance Model
Video	MAD	R@100,IoU=0.3	49.38	VLG-Net + Guidance Model
Video	MAD	R@100,IoU=0.5	39.12	VLG-Net + Guidance Model
Video	MAD	R@5,IoU=0.1	16.07	VLG-Net + Guidance Model
Video	MAD	R@5,IoU=0.5	8.78	VLG-Net + Guidance Model
Video	MAD	R@50,IoU=0.1	45.35	VLG-Net + Guidance Model
Video	MAD	R@50,IoU=0.3	39.77	VLG-Net + Guidance Model
Video	MAD	R@50,IoU=0.5	30.22	VLG-Net + Guidance Model

Abstract

Results

Task	Dataset	Metric	Value	Model
Video	MAD	R@1,IoU=0.1	9.3	Zero-Shot CLIP + Guidance Model
Video	MAD	R@1,IoU=0.3	4.65	Zero-Shot CLIP + Guidance Model
Video	MAD	R@1,IoU=0.5	2.16	Zero-Shot CLIP + Guidance Model
Video	MAD	R@10,IoU=0.1	24.3	Zero-Shot CLIP + Guidance Model
Video	MAD	R@10,IoU=0.3	17.73	Zero-Shot CLIP + Guidance Model
Video	MAD	R@10,IoU=0.5	11.09	Zero-Shot CLIP + Guidance Model
Video	MAD	R@100,IoU=0.1	47.35	Zero-Shot CLIP + Guidance Model
Video	MAD	R@100,IoU=0.3	39.58	Zero-Shot CLIP + Guidance Model
Video	MAD	R@100,IoU=0.5	29.68	Zero-Shot CLIP + Guidance Model
Video	MAD	R@5,IoU=0.1	18.96	Zero-Shot CLIP + Guidance Model
Video	MAD	R@5,IoU=0.3	13.06	Zero-Shot CLIP + Guidance Model
Video	MAD	R@5,IoU=0.5	7.4	Zero-Shot CLIP + Guidance Model
Video	MAD	R@50,IoU=0.1	39.79	Zero-Shot CLIP + Guidance Model
Video	MAD	R@50,IoU=0.3	32.23	Zero-Shot CLIP + Guidance Model
Video	MAD	R@50,IoU=0.5	23.21	Zero-Shot CLIP + Guidance Model
Video	MAD	R@1,IoU=0.1	5.6	VLG-Net + Guidance Model
Video	MAD	R@1,IoU=0.3	4.28	VLG-Net + Guidance Model
Video	MAD	R@1,IoU=0.5	2.48	VLG-Net + Guidance Model
Video	MAD	R@10,IoU=0.1	23.64	VLG-Net + Guidance Model
Video	MAD	R@10,IoU=0.3	19.86	VLG-Net + Guidance Model
Video	MAD	R@10,IoU=0.5	13.72	VLG-Net + Guidance Model
Video	MAD	R@100,IoU=0.1	55.59	VLG-Net + Guidance Model
Video	MAD	R@100,IoU=0.3	49.38	VLG-Net + Guidance Model
Video	MAD	R@100,IoU=0.5	39.12	VLG-Net + Guidance Model
Video	MAD	R@5,IoU=0.1	16.07	VLG-Net + Guidance Model
Video	MAD	R@5,IoU=0.5	8.78	VLG-Net + Guidance Model
Video	MAD	R@50,IoU=0.1	45.35	VLG-Net + Guidance Model
Video	MAD	R@50,IoU=0.3	39.77	VLG-Net + Guidance Model
Video	MAD	R@50,IoU=0.5	30.22	VLG-Net + Guidance Model

Localizing Moments in Long Video Via Multimodal Guidance

Abstract

Results

Related Papers

Localizing Moments in Long Video Via Multimodal Guidance

Abstract

Results

Related Papers