Span-based Localizing Network for Natural Language Video Localization

Hao Zhang, Aixin Sun, Wei Jing, Joey Tianyi Zhou

2020-04-29ACL 2020 6Temporal Sentence Grounding

Abstract

Given an untrimmed video and a text query, natural language video localization (NLVL) is to locate a matching span from the video that semantically corresponds to the query. Existing solutions formulate NLVL either as a ranking task and apply multimodal matching architecture, or as a regression task to directly regress the target video span. In this work, we address NLVL task with a span-based QA approach by treating the input video as text passage. We propose a video span localizing network (VSLNet), on top of the standard span-based QA framework, to address NLVL. The proposed VSLNet tackles the differences between NLVL and span-based QA through a simple yet effective query-guided highlighting (QGH) strategy. The QGH guides VSLNet to search for matching video span within a highlighted region. Through extensive experiments on three benchmark datasets, we show that the proposed VSLNet outperforms the state-of-the-art methods; and adopting span-based QA framework is a promising direction to solve NLVL.

Results

Task	Dataset	Metric	Value	Model
Video Understanding	Ego4D-Goalstep	R@1,IoU=0.3	11.7	VSLNet
Video	Ego4D-Goalstep	R@1,IoU=0.3	11.7	VSLNet
Temporal Sentence Grounding	Ego4D-Goalstep	R@1,IoU=0.3	11.7	VSLNet

Related Papers

DeCafNet: Delegate and Conquer for Efficient Temporal Grounding in Long Videos2025-05-22 Weakly Supervised Temporal Sentence Grounding via Positive Sample Mining2025-05-10 Contrast-Unity for Partially-Supervised Temporal Sentence Grounding2025-02-18 Diversified Augmentation with Domain Adaptation for Debiased Video Temporal Grounding2025-01-12 Multi-Pair Temporal Sentence Grounding via Multi-Thread Knowledge Transfer Network2024-12-20 Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models2024-10-04 Transformer with Controlled Attention for Synchronous Motion Captioning2024-09-13 Diversifying Query: Region-Guided Transformer for Temporal Sentence Grounding2024-05-31