VTG-GPT: Tuning-Free Zero-Shot Video Temporal Grounding with GPT

Yifang Xu, Yunzhuo Sun, Zien Xie, Benxiang Zhai, Sidan Du

2024-03-04Applied Sciences 2024 2Image Captioning Zero-shot Moment Retrieval

Abstract

Video temporal grounding (VTG) aims to locate specific temporal segments from an untrimmed video based on a linguistic query. Most existing VTG models are trained on extensive annotated video-text pairs, a process that not only introduces human biases from the queries but also incurs significant computational costs. To tackle these challenges, we propose VTG-GPT, a GPT-based method for zero-shot VTG without training or fine-tuning. To reduce prejudice in the original query, we employ Baichuan2 to generate debiased queries. To lessen redundant information in videos, we apply MiniGPT-v2 to transform visual content into more precise captions. Finally, we devise the proposal generator and post-processing to produce accurate segments from debiased queries and image captions. Extensive experiments demonstrate that VTG-GPT significantly outperforms SOTA methods in zero-shot settings and surpasses unsupervised approaches. More notably, it achieves competitive performance comparable to supervised methods. The code is available on https://github.com/YoucanBaby/VTG-GPT

Results

Task	Dataset	Metric	Value	Model
Moment Retrieval	QVHighlights	R1@0.5	54.26	VTG-GPT
Moment Retrieval	QVHighlights	R1@0.7	38.45	VTG-GPT
Moment Retrieval	QVHighlights	mAP	30.91	VTG-GPT
Moment Retrieval	QVHighlights	mAP@0.5	54.17	VTG-GPT
Moment Retrieval	QVHighlights	mAP@0.75	29.73	VTG-GPT

Related Papers

Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16 Mask-aware Text-to-Image Retrieval: Referring Expression Segmentation Meets Cross-modal Retrieval2025-06-28 HalLoc: Token-level Localization of Hallucinations for Vision Language Models2025-06-12 ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs2025-06-11 A Novel Lightweight Transformer with Edge-Aware Fusion for Remote Sensing Image Captioning2025-06-11 Vision Matters: Simple Visual Perturbations Can Boost Multimodal Math Reasoning2025-06-11 Edit Flows: Flow Matching with Edit Operations2025-06-10 Dense Retrievers Can Fail on Simple Queries: Revealing The Granularity Dilemma of Embeddings2025-06-10