VTimeLLM: Empower LLM to Grasp Video Moments

Bin Huang, Xin Wang, Hong Chen, Zihan Song, Wenwu Zhu

2023-11-30CVPR 2024 1VCGBench-Diverse Video-based Generative Performance Benchmarking Video-based Generative Performance Benchmarking (Contextual Understanding)Video Grounding Video-based Generative Performance Benchmarking (Correctness of Information)Video Question Answering Video Captioning Video-based Generative Performance Benchmarking (Consistency)Video-based Generative Performance Benchmarking (Temporal Understanding)Dense Video Captioning Video-based Generative Performance Benchmarking (Detail Orientation))Temporal Relation Extraction

Paper PDF Code(official)

Abstract

Large language models (LLMs) have shown remarkable text understanding capabilities, which have been extended as Video LLMs to handle video data for comprehending visual details. However, existing Video LLMs can only provide a coarse description of the entire video, failing to capture the precise start and end time boundary of specific events. In this paper, we solve this issue via proposing VTimeLLM, a novel Video LLM designed for fine-grained video moment understanding and reasoning with respect to time boundary. Specifically, our VTimeLLM adopts a boundary-aware three-stage training strategy, which respectively utilizes image-text pairs for feature alignment, multiple-event videos to increase temporal-boundary awareness, and high-quality video-instruction tuning to further improve temporal understanding ability as well as align with human intents. Extensive experiments demonstrate that in fine-grained time-related comprehension tasks for videos such as Temporal Video Grounding and Dense Video Captioning, VTimeLLM significantly outperforms existing Video LLMs. Besides, benefits from the fine-grained temporal understanding of the videos further enable VTimeLLM to beat existing Video LLMs in video dialogue benchmark, showing its superior cross-modal understanding and reasoning abilities.

Results

Task	Dataset	Metric	Value	Model
Relation Extraction	Vinoground	Group Score	5.2	VTimeLLM
Relation Extraction	Vinoground	Text Score	19.4	VTimeLLM
Relation Extraction	Vinoground	Video Score	27	VTimeLLM
Visual Question Answering (VQA)	VideoInstruct	Consistency	2.47	VTimeLLM
Visual Question Answering (VQA)	VideoInstruct	Contextual Understanding	3.4	VTimeLLM
Visual Question Answering (VQA)	VideoInstruct	Correctness of Information	2.78	VTimeLLM
Visual Question Answering (VQA)	VideoInstruct	Detail Orientation	3.1	VTimeLLM
Visual Question Answering (VQA)	VideoInstruct	Temporal Understanding	2.49	VTimeLLM
Visual Question Answering (VQA)	VideoInstruct	mean	2.85	VTimeLLM
Visual Question Answering (VQA)	VideoInstruct	gpt-score	3.4	VTimeLLM
Visual Question Answering (VQA)	VideoInstruct	gpt-score	2.78	VTimeLLM
Visual Question Answering (VQA)	VideoInstruct	gpt-score	3.1	VTimeLLM
Visual Question Answering (VQA)	VideoInstruct	gpt-score	2.49	VTimeLLM
Visual Question Answering (VQA)	VideoInstruct	gpt-score	2.47	VTimeLLM
Video Question Answering	OVBench	AVG	33.1	VTimeLLM (7B)
Video Captioning	ActivityNet Captions	CIDEr	27.6	VTimeLLM
Video Captioning	ActivityNet Captions	SODA	5.8	VTimeLLM
Temporal Relation Extraction	Vinoground	Group Score	5.2	VTimeLLM
Temporal Relation Extraction	Vinoground	Text Score	19.4	VTimeLLM
Temporal Relation Extraction	Vinoground	Video Score	27	VTimeLLM
Dense Video Captioning	ActivityNet Captions	CIDEr	27.6	VTimeLLM
Dense Video Captioning	ActivityNet Captions	SODA	5.8	VTimeLLM
Generative Visual Question Answering	VideoInstruct	Consistency	2.47	VTimeLLM
Generative Visual Question Answering	VideoInstruct	Contextual Understanding	3.4	VTimeLLM
Generative Visual Question Answering	VideoInstruct	Correctness of Information	2.78	VTimeLLM
Generative Visual Question Answering	VideoInstruct	Detail Orientation	3.1	VTimeLLM
Generative Visual Question Answering	VideoInstruct	Temporal Understanding	2.49	VTimeLLM
Generative Visual Question Answering	VideoInstruct	mean	2.85	VTimeLLM
Generative Visual Question Answering	VideoInstruct	gpt-score	3.4	VTimeLLM
Generative Visual Question Answering	VideoInstruct	gpt-score	2.78	VTimeLLM
Generative Visual Question Answering	VideoInstruct	gpt-score	3.1	VTimeLLM
Generative Visual Question Answering	VideoInstruct	gpt-score	2.49	VTimeLLM
Generative Visual Question Answering	VideoInstruct	gpt-score	2.47	VTimeLLM
Video-based Generative Performance Benchmarking (Correctness of Information)	VideoInstruct	gpt-score	2.78	VTimeLLM
Video-based Generative Performance Benchmarking	VideoInstruct	Consistency	2.47	VTimeLLM
Video-based Generative Performance Benchmarking	VideoInstruct	Contextual Understanding	3.4	VTimeLLM
Video-based Generative Performance Benchmarking	VideoInstruct	Correctness of Information	2.78	VTimeLLM
Video-based Generative Performance Benchmarking	VideoInstruct	Detail Orientation	3.1	VTimeLLM
Video-based Generative Performance Benchmarking	VideoInstruct	Temporal Understanding	2.49	VTimeLLM
Video-based Generative Performance Benchmarking	VideoInstruct	mean	2.85	VTimeLLM
Video-based Generative Performance Benchmarking	VideoInstruct	gpt-score	3.4	VTimeLLM
Video-based Generative Performance Benchmarking	VideoInstruct	gpt-score	2.78	VTimeLLM
Video-based Generative Performance Benchmarking	VideoInstruct	gpt-score	3.1	VTimeLLM
Video-based Generative Performance Benchmarking	VideoInstruct	gpt-score	2.49	VTimeLLM
Video-based Generative Performance Benchmarking	VideoInstruct	gpt-score	2.47	VTimeLLM
VCGBench-Diverse	VideoInstruct	Consistency	2.35	VTimeLLM
VCGBench-Diverse	VideoInstruct	Contextual Understanding	2.48	VTimeLLM
VCGBench-Diverse	VideoInstruct	Correctness of Information	2.16	VTimeLLM
VCGBench-Diverse	VideoInstruct	Dense Captioning	1.13	VTimeLLM
VCGBench-Diverse	VideoInstruct	Detail Orientation	2.41	VTimeLLM
VCGBench-Diverse	VideoInstruct	Reasoning	3.45	VTimeLLM
VCGBench-Diverse	VideoInstruct	Spatial Understanding	2.29	VTimeLLM
VCGBench-Diverse	VideoInstruct	Temporal Understanding	1.46	VTimeLLM
VCGBench-Diverse	VideoInstruct	mean	2.17	VTimeLLM

VTimeLLM: Empower LLM to Grasp Video Moments

Abstract

Results

Related Papers

VTimeLLM: Empower LLM to Grasp Video Moments

Abstract

Results

Related Papers