Bin Huang, Xin Wang, Hong Chen, Zihan Song, Wenwu Zhu
Large language models (LLMs) have shown remarkable text understanding capabilities, which have been extended as Video LLMs to handle video data for comprehending visual details. However, existing Video LLMs can only provide a coarse description of the entire video, failing to capture the precise start and end time boundary of specific events. In this paper, we solve this issue via proposing VTimeLLM, a novel Video LLM designed for fine-grained video moment understanding and reasoning with respect to time boundary. Specifically, our VTimeLLM adopts a boundary-aware three-stage training strategy, which respectively utilizes image-text pairs for feature alignment, multiple-event videos to increase temporal-boundary awareness, and high-quality video-instruction tuning to further improve temporal understanding ability as well as align with human intents. Extensive experiments demonstrate that in fine-grained time-related comprehension tasks for videos such as Temporal Video Grounding and Dense Video Captioning, VTimeLLM significantly outperforms existing Video LLMs. Besides, benefits from the fine-grained temporal understanding of the videos further enable VTimeLLM to beat existing Video LLMs in video dialogue benchmark, showing its superior cross-modal understanding and reasoning abilities.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Relation Extraction | Vinoground | Group Score | 5.2 | VTimeLLM |
| Relation Extraction | Vinoground | Text Score | 19.4 | VTimeLLM |
| Relation Extraction | Vinoground | Video Score | 27 | VTimeLLM |
| Visual Question Answering (VQA) | VideoInstruct | Consistency | 2.47 | VTimeLLM |
| Visual Question Answering (VQA) | VideoInstruct | Contextual Understanding | 3.4 | VTimeLLM |
| Visual Question Answering (VQA) | VideoInstruct | Correctness of Information | 2.78 | VTimeLLM |
| Visual Question Answering (VQA) | VideoInstruct | Detail Orientation | 3.1 | VTimeLLM |
| Visual Question Answering (VQA) | VideoInstruct | Temporal Understanding | 2.49 | VTimeLLM |
| Visual Question Answering (VQA) | VideoInstruct | mean | 2.85 | VTimeLLM |
| Visual Question Answering (VQA) | VideoInstruct | gpt-score | 3.4 | VTimeLLM |
| Visual Question Answering (VQA) | VideoInstruct | gpt-score | 2.78 | VTimeLLM |
| Visual Question Answering (VQA) | VideoInstruct | gpt-score | 3.1 | VTimeLLM |
| Visual Question Answering (VQA) | VideoInstruct | gpt-score | 2.49 | VTimeLLM |
| Visual Question Answering (VQA) | VideoInstruct | gpt-score | 2.47 | VTimeLLM |
| Video Question Answering | OVBench | AVG | 33.1 | VTimeLLM (7B) |
| Video Captioning | ActivityNet Captions | CIDEr | 27.6 | VTimeLLM |
| Video Captioning | ActivityNet Captions | SODA | 5.8 | VTimeLLM |
| Temporal Relation Extraction | Vinoground | Group Score | 5.2 | VTimeLLM |
| Temporal Relation Extraction | Vinoground | Text Score | 19.4 | VTimeLLM |
| Temporal Relation Extraction | Vinoground | Video Score | 27 | VTimeLLM |
| Dense Video Captioning | ActivityNet Captions | CIDEr | 27.6 | VTimeLLM |
| Dense Video Captioning | ActivityNet Captions | SODA | 5.8 | VTimeLLM |
| Generative Visual Question Answering | VideoInstruct | Consistency | 2.47 | VTimeLLM |
| Generative Visual Question Answering | VideoInstruct | Contextual Understanding | 3.4 | VTimeLLM |
| Generative Visual Question Answering | VideoInstruct | Correctness of Information | 2.78 | VTimeLLM |
| Generative Visual Question Answering | VideoInstruct | Detail Orientation | 3.1 | VTimeLLM |
| Generative Visual Question Answering | VideoInstruct | Temporal Understanding | 2.49 | VTimeLLM |
| Generative Visual Question Answering | VideoInstruct | mean | 2.85 | VTimeLLM |
| Generative Visual Question Answering | VideoInstruct | gpt-score | 3.4 | VTimeLLM |
| Generative Visual Question Answering | VideoInstruct | gpt-score | 2.78 | VTimeLLM |
| Generative Visual Question Answering | VideoInstruct | gpt-score | 3.1 | VTimeLLM |
| Generative Visual Question Answering | VideoInstruct | gpt-score | 2.49 | VTimeLLM |
| Generative Visual Question Answering | VideoInstruct | gpt-score | 2.47 | VTimeLLM |
| Video-based Generative Performance Benchmarking (Correctness of Information) | VideoInstruct | gpt-score | 2.78 | VTimeLLM |
| Video-based Generative Performance Benchmarking | VideoInstruct | Consistency | 2.47 | VTimeLLM |
| Video-based Generative Performance Benchmarking | VideoInstruct | Contextual Understanding | 3.4 | VTimeLLM |
| Video-based Generative Performance Benchmarking | VideoInstruct | Correctness of Information | 2.78 | VTimeLLM |
| Video-based Generative Performance Benchmarking | VideoInstruct | Detail Orientation | 3.1 | VTimeLLM |
| Video-based Generative Performance Benchmarking | VideoInstruct | Temporal Understanding | 2.49 | VTimeLLM |
| Video-based Generative Performance Benchmarking | VideoInstruct | mean | 2.85 | VTimeLLM |
| Video-based Generative Performance Benchmarking | VideoInstruct | gpt-score | 3.4 | VTimeLLM |
| Video-based Generative Performance Benchmarking | VideoInstruct | gpt-score | 2.78 | VTimeLLM |
| Video-based Generative Performance Benchmarking | VideoInstruct | gpt-score | 3.1 | VTimeLLM |
| Video-based Generative Performance Benchmarking | VideoInstruct | gpt-score | 2.49 | VTimeLLM |
| Video-based Generative Performance Benchmarking | VideoInstruct | gpt-score | 2.47 | VTimeLLM |
| VCGBench-Diverse | VideoInstruct | Consistency | 2.35 | VTimeLLM |
| VCGBench-Diverse | VideoInstruct | Contextual Understanding | 2.48 | VTimeLLM |
| VCGBench-Diverse | VideoInstruct | Correctness of Information | 2.16 | VTimeLLM |
| VCGBench-Diverse | VideoInstruct | Dense Captioning | 1.13 | VTimeLLM |
| VCGBench-Diverse | VideoInstruct | Detail Orientation | 2.41 | VTimeLLM |
| VCGBench-Diverse | VideoInstruct | Reasoning | 3.45 | VTimeLLM |
| VCGBench-Diverse | VideoInstruct | Spatial Understanding | 2.29 | VTimeLLM |
| VCGBench-Diverse | VideoInstruct | Temporal Understanding | 1.46 | VTimeLLM |
| VCGBench-Diverse | VideoInstruct | mean | 2.17 | VTimeLLM |