TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Localizing Moments in Long Video Via Multimodal Guidance

Localizing Moments in Long Video Via Multimodal Guidance

Wayner Barrios, Mattia Soldan, Alberto Mario Ceballos-Arroyo, Fabian Caba Heilbron, Bernard Ghanem

2023-02-26ICCV 2023 1Video GroundingNatural Language Visual GroundingVideo UnderstandingNatural Language Moment Retrieval
PaperPDFCode(official)

Abstract

The recent introduction of the large-scale, long-form MAD and Ego4D datasets has enabled researchers to investigate the performance of current state-of-the-art methods for video grounding in the long-form setup, with interesting findings: current grounding methods alone fail at tackling this challenging task and setup due to their inability to process long video sequences. In this paper, we propose a method for improving the performance of natural language grounding in long videos by identifying and pruning out non-describable windows. We design a guided grounding framework consisting of a Guidance Model and a base grounding model. The Guidance Model emphasizes describable windows, while the base grounding model analyzes short temporal windows to determine which segments accurately match a given language query. We offer two designs for the Guidance Model: Query-Agnostic and Query-Dependent, which balance efficiency and accuracy. Experiments demonstrate that our proposed method outperforms state-of-the-art models by 4.1% in MAD and 4.52% in Ego4D (NLQ), respectively. Code, data and MAD's audio features necessary to reproduce our experiments are available at: https://github.com/waybarrios/guidance-based-video-grounding.

Results

TaskDatasetMetricValueModel
VideoMADR@1,IoU=0.19.3Zero-Shot CLIP + Guidance Model
VideoMADR@1,IoU=0.34.65Zero-Shot CLIP + Guidance Model
VideoMADR@1,IoU=0.52.16Zero-Shot CLIP + Guidance Model
VideoMADR@10,IoU=0.124.3Zero-Shot CLIP + Guidance Model
VideoMADR@10,IoU=0.317.73Zero-Shot CLIP + Guidance Model
VideoMADR@10,IoU=0.511.09Zero-Shot CLIP + Guidance Model
VideoMADR@100,IoU=0.147.35Zero-Shot CLIP + Guidance Model
VideoMADR@100,IoU=0.339.58Zero-Shot CLIP + Guidance Model
VideoMADR@100,IoU=0.529.68Zero-Shot CLIP + Guidance Model
VideoMADR@5,IoU=0.118.96Zero-Shot CLIP + Guidance Model
VideoMADR@5,IoU=0.313.06Zero-Shot CLIP + Guidance Model
VideoMADR@5,IoU=0.57.4Zero-Shot CLIP + Guidance Model
VideoMADR@50,IoU=0.139.79Zero-Shot CLIP + Guidance Model
VideoMADR@50,IoU=0.332.23Zero-Shot CLIP + Guidance Model
VideoMADR@50,IoU=0.523.21Zero-Shot CLIP + Guidance Model
VideoMADR@1,IoU=0.15.6VLG-Net + Guidance Model
VideoMADR@1,IoU=0.34.28VLG-Net + Guidance Model
VideoMADR@1,IoU=0.52.48VLG-Net + Guidance Model
VideoMADR@10,IoU=0.123.64VLG-Net + Guidance Model
VideoMADR@10,IoU=0.319.86VLG-Net + Guidance Model
VideoMADR@10,IoU=0.513.72VLG-Net + Guidance Model
VideoMADR@100,IoU=0.155.59VLG-Net + Guidance Model
VideoMADR@100,IoU=0.349.38VLG-Net + Guidance Model
VideoMADR@100,IoU=0.539.12VLG-Net + Guidance Model
VideoMADR@5,IoU=0.116.07VLG-Net + Guidance Model
VideoMADR@5,IoU=0.58.78VLG-Net + Guidance Model
VideoMADR@50,IoU=0.145.35VLG-Net + Guidance Model
VideoMADR@50,IoU=0.339.77VLG-Net + Guidance Model
VideoMADR@50,IoU=0.530.22VLG-Net + Guidance Model

Related Papers

VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding2025-07-17UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks2025-07-15EmbRACE-3K: Embodied Reasoning and Action in Complex Environments2025-07-14Chat with AI: The Surprising Turn of Real-time Video Communication from Human to AI2025-07-14Beyond Appearance: Geometric Cues for Robust Video Instance Segmentation2025-07-08Omni-Video: Democratizing Unified Video Understanding and Generation2025-07-08MCAM: Multimodal Causal Analysis Model for Ego-Vehicle-Level Driving Video Understanding2025-07-08Video Event Reasoning and Prediction by Fusing World Knowledge from LLMs with Vision Foundation Models2025-07-08