TimeZero: Temporal Video Grounding with Reasoning-Guided LVLM

Ye Wang, Boshen Xu, Zihao Yue, Zihan Xiao, Ziheng Wang, Liang Zhang, Dingyi Yang, Wenxuan Wang, Qin Jin

2025-03-17Video Grounding

Abstract

We introduce TimeZero, a reasoning-guided LVLM designed for the temporal video grounding (TVG) task. This task requires precisely localizing relevant video segments within long videos based on a given language query. TimeZero tackles this challenge by extending the inference process, enabling the model to reason about video-language relationships solely through reinforcement learning. To evaluate the effectiveness of TimeZero, we conduct experiments on two benchmarks, where TimeZero achieves state-of-the-art performance on Charades-STA. Code is available at https://github.com/www-Ye/TimeZero.

Related Papers

VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding2025-07-17 Reinforcement Learning Tuning for VideoLLMs: Reward Design and Data Efficiency2025-06-02 SAMA: Towards Multi-Turn Referential Grounded Video Chat with Large Language Models2025-05-24 DeCafNet: Delegate and Conquer for Efficient Temporal Grounding in Long Videos2025-05-22 Object-Shot Enhanced Grounding Network for Egocentric Video2025-05-07 Enhancing Weakly Supervised Video Grounding via Diverse Inference Strategies for Boundary and Prediction Selection2025-03-29 VideoGEM: Training-free Action Grounding in Videos2025-03-26 SpaceVLLM: Endowing Multimodal Large Language Model with Spatio-Temporal Video Grounding Capability2025-03-18