TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/VTimeLLM: Empower LLM to Grasp Video Moments

VTimeLLM: Empower LLM to Grasp Video Moments

Bin Huang, Xin Wang, Hong Chen, Zihan Song, Wenwu Zhu

2023-11-30CVPR 2024 1VCGBench-DiverseVideo-based Generative Performance BenchmarkingVideo-based Generative Performance Benchmarking (Contextual Understanding)Video GroundingVideo-based Generative Performance Benchmarking (Correctness of Information)Video Question AnsweringVideo CaptioningVideo-based Generative Performance Benchmarking (Consistency)Video-based Generative Performance Benchmarking (Temporal Understanding)Dense Video CaptioningVideo-based Generative Performance Benchmarking (Detail Orientation))Temporal Relation Extraction
PaperPDFCode(official)

Abstract

Large language models (LLMs) have shown remarkable text understanding capabilities, which have been extended as Video LLMs to handle video data for comprehending visual details. However, existing Video LLMs can only provide a coarse description of the entire video, failing to capture the precise start and end time boundary of specific events. In this paper, we solve this issue via proposing VTimeLLM, a novel Video LLM designed for fine-grained video moment understanding and reasoning with respect to time boundary. Specifically, our VTimeLLM adopts a boundary-aware three-stage training strategy, which respectively utilizes image-text pairs for feature alignment, multiple-event videos to increase temporal-boundary awareness, and high-quality video-instruction tuning to further improve temporal understanding ability as well as align with human intents. Extensive experiments demonstrate that in fine-grained time-related comprehension tasks for videos such as Temporal Video Grounding and Dense Video Captioning, VTimeLLM significantly outperforms existing Video LLMs. Besides, benefits from the fine-grained temporal understanding of the videos further enable VTimeLLM to beat existing Video LLMs in video dialogue benchmark, showing its superior cross-modal understanding and reasoning abilities.

Results

TaskDatasetMetricValueModel
Relation ExtractionVinogroundGroup Score5.2VTimeLLM
Relation ExtractionVinogroundText Score19.4VTimeLLM
Relation ExtractionVinogroundVideo Score27VTimeLLM
Visual Question Answering (VQA)VideoInstructConsistency2.47VTimeLLM
Visual Question Answering (VQA)VideoInstructContextual Understanding3.4VTimeLLM
Visual Question Answering (VQA)VideoInstructCorrectness of Information2.78VTimeLLM
Visual Question Answering (VQA)VideoInstructDetail Orientation3.1VTimeLLM
Visual Question Answering (VQA)VideoInstructTemporal Understanding2.49VTimeLLM
Visual Question Answering (VQA)VideoInstructmean2.85VTimeLLM
Visual Question Answering (VQA)VideoInstructgpt-score3.4VTimeLLM
Visual Question Answering (VQA)VideoInstructgpt-score2.78VTimeLLM
Visual Question Answering (VQA)VideoInstructgpt-score3.1VTimeLLM
Visual Question Answering (VQA)VideoInstructgpt-score2.49VTimeLLM
Visual Question Answering (VQA)VideoInstructgpt-score2.47VTimeLLM
Video Question AnsweringOVBenchAVG33.1VTimeLLM (7B)
Video CaptioningActivityNet CaptionsCIDEr27.6VTimeLLM
Video CaptioningActivityNet CaptionsSODA5.8VTimeLLM
Temporal Relation ExtractionVinogroundGroup Score5.2VTimeLLM
Temporal Relation ExtractionVinogroundText Score19.4VTimeLLM
Temporal Relation ExtractionVinogroundVideo Score27VTimeLLM
Dense Video CaptioningActivityNet CaptionsCIDEr27.6VTimeLLM
Dense Video CaptioningActivityNet CaptionsSODA5.8VTimeLLM
Generative Visual Question AnsweringVideoInstructConsistency2.47VTimeLLM
Generative Visual Question AnsweringVideoInstructContextual Understanding3.4VTimeLLM
Generative Visual Question AnsweringVideoInstructCorrectness of Information2.78VTimeLLM
Generative Visual Question AnsweringVideoInstructDetail Orientation3.1VTimeLLM
Generative Visual Question AnsweringVideoInstructTemporal Understanding2.49VTimeLLM
Generative Visual Question AnsweringVideoInstructmean2.85VTimeLLM
Generative Visual Question AnsweringVideoInstructgpt-score3.4VTimeLLM
Generative Visual Question AnsweringVideoInstructgpt-score2.78VTimeLLM
Generative Visual Question AnsweringVideoInstructgpt-score3.1VTimeLLM
Generative Visual Question AnsweringVideoInstructgpt-score2.49VTimeLLM
Generative Visual Question AnsweringVideoInstructgpt-score2.47VTimeLLM
Video-based Generative Performance Benchmarking (Correctness of Information)VideoInstructgpt-score2.78VTimeLLM
Video-based Generative Performance BenchmarkingVideoInstructConsistency2.47VTimeLLM
Video-based Generative Performance BenchmarkingVideoInstructContextual Understanding3.4VTimeLLM
Video-based Generative Performance BenchmarkingVideoInstructCorrectness of Information2.78VTimeLLM
Video-based Generative Performance BenchmarkingVideoInstructDetail Orientation3.1VTimeLLM
Video-based Generative Performance BenchmarkingVideoInstructTemporal Understanding2.49VTimeLLM
Video-based Generative Performance BenchmarkingVideoInstructmean2.85VTimeLLM
Video-based Generative Performance BenchmarkingVideoInstructgpt-score3.4VTimeLLM
Video-based Generative Performance BenchmarkingVideoInstructgpt-score2.78VTimeLLM
Video-based Generative Performance BenchmarkingVideoInstructgpt-score3.1VTimeLLM
Video-based Generative Performance BenchmarkingVideoInstructgpt-score2.49VTimeLLM
Video-based Generative Performance BenchmarkingVideoInstructgpt-score2.47VTimeLLM
VCGBench-DiverseVideoInstructConsistency2.35VTimeLLM
VCGBench-DiverseVideoInstructContextual Understanding2.48VTimeLLM
VCGBench-DiverseVideoInstructCorrectness of Information2.16VTimeLLM
VCGBench-DiverseVideoInstructDense Captioning1.13VTimeLLM
VCGBench-DiverseVideoInstructDetail Orientation2.41VTimeLLM
VCGBench-DiverseVideoInstructReasoning3.45VTimeLLM
VCGBench-DiverseVideoInstructSpatial Understanding2.29VTimeLLM
VCGBench-DiverseVideoInstructTemporal Understanding1.46VTimeLLM
VCGBench-DiverseVideoInstructmean2.17VTimeLLM

Related Papers

VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding2025-07-17UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks2025-07-15Decoupled Seg Tokens Make Stronger Reasoning Video Segmenter and Grounder2025-06-28LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs2025-06-27Show, Tell and Summarize: Dense Video Captioning Using Visual Cue Aided Sentence Summarization2025-06-25Dense Video Captioning using Graph-based Sentence Summarization2025-06-25How Far Can Off-the-Shelf Multimodal Large Language Models Go in Online Episodic Memory Question Answering?2025-06-19video-SALMONN 2: Captioning-Enhanced Audio-Visual Large Language Models2025-06-18