Kunchang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, LiMin Wang, Yu Qiao
In this paper, we initiate an attempt of developing an end-to-end chat-centric video understanding system, coined as VideoChat. It integrates video foundation models and large language models via a learnable neural interface, excelling in spatiotemporal reasoning, event localization, and causal relationship inference. To instructively tune this system, we build a video-centric instruction dataset, composed of thousands of videos associated with detailed descriptions and conversations. This dataset emphasizes spatiotemporal reasoning and captures causal relationships, providing a valuable asset for training our chat-centric video understanding system. Preliminary qualitative experiments demonstrate the potential of our system across a broad spectrum of video applications, which could serve as a simple prototype system for future research on chat-centric video understanding. Access our code and data at https://github.com/OpenGVLab/Ask-Anything
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Question Answering | NExT-QA (Open-ended VideoQA) | Accuracy | 56.6 | VideoChat |
| Question Answering | NExT-QA (Open-ended VideoQA) | Confidence Score | 3.2 | VideoChat |
| Question Answering | VNBench | Accuracy | 12.4 | VideoChat2 |
| Question Answering | MSVD-QA | Accuracy | 56.3 | Video Chat-7B |
| Question Answering | MSVD-QA | Confidence Score | 2.8 | Video Chat-7B |
| Question Answering | TGIF-QA | Accuracy | 34.4 | Video Chat-7B |
| Question Answering | TGIF-QA | Confidence Score | 2.3 | Video Chat-7B |
| Question Answering | MSRVTT-QA | Accuracy | 45 | Video Chat-7B |
| Question Answering | MSRVTT-QA | Confidence Score | 2.5 | Video Chat-7B |
| Question Answering | ActivityNet-QA | Accuracy | 26.5 | Video Chat |
| Question Answering | ActivityNet-QA | Confidence Score | 2.2 | Video Chat |
| Visual Question Answering (VQA) | VideoInstruct | Consistency | 2.24 | Video Chat |
| Visual Question Answering (VQA) | VideoInstruct | Contextual Understanding | 2.53 | Video Chat |
| Visual Question Answering (VQA) | VideoInstruct | Correctness of Information | 2.23 | Video Chat |
| Visual Question Answering (VQA) | VideoInstruct | Detail Orientation | 2.5 | Video Chat |
| Visual Question Answering (VQA) | VideoInstruct | Temporal Understanding | 1.94 | Video Chat |
| Visual Question Answering (VQA) | VideoInstruct | mean | 2.29 | Video Chat |
| Visual Question Answering (VQA) | VideoInstruct | gpt-score | 2.53 | Video Chat |
| Visual Question Answering (VQA) | VideoInstruct | gpt-score | 2.32 | Video Chat |
| Visual Question Answering (VQA) | VideoInstruct | gpt-score | 2.5 | Video Chat |
| Visual Question Answering (VQA) | VideoInstruct | gpt-score | 1.94 | Video Chat |
| Visual Question Answering (VQA) | VideoInstruct | gpt-score | 2.24 | Video Chat |
| Video Question Answering | ActivityNet-QA | Accuracy | 26.5 | Video Chat |
| Video Question Answering | ActivityNet-QA | Confidence score | 2.2 | Video Chat |
| Video Question Answering | MVBench | Avg. | 35.5 | VideoChat |
| Video Question Answering | VNBench | Accuracy | 12.4 | VideoChat2 |
| Video Question Answering | MSVD-QA | Accuracy | 56.3 | Video Chat-7B |
| Video Question Answering | MSVD-QA | Confidence Score | 2.8 | Video Chat-7B |
| Video Question Answering | TGIF-QA | Accuracy | 34.4 | Video Chat-7B |
| Video Question Answering | TGIF-QA | Confidence Score | 2.3 | Video Chat-7B |
| Video Question Answering | MSRVTT-QA | Accuracy | 45 | Video Chat-7B |
| Video Question Answering | MSRVTT-QA | Confidence Score | 2.5 | Video Chat-7B |
| Video Question Answering | ActivityNet-QA | Accuracy | 26.5 | Video Chat |
| Video Question Answering | ActivityNet-QA | Confidence Score | 2.2 | Video Chat |
| Generative Visual Question Answering | VideoInstruct | Consistency | 2.24 | Video Chat |
| Generative Visual Question Answering | VideoInstruct | Contextual Understanding | 2.53 | Video Chat |
| Generative Visual Question Answering | VideoInstruct | Correctness of Information | 2.23 | Video Chat |
| Generative Visual Question Answering | VideoInstruct | Detail Orientation | 2.5 | Video Chat |
| Generative Visual Question Answering | VideoInstruct | Temporal Understanding | 1.94 | Video Chat |
| Generative Visual Question Answering | VideoInstruct | mean | 2.29 | Video Chat |
| Generative Visual Question Answering | VideoInstruct | gpt-score | 2.53 | Video Chat |
| Generative Visual Question Answering | VideoInstruct | gpt-score | 2.32 | Video Chat |
| Generative Visual Question Answering | VideoInstruct | gpt-score | 2.5 | Video Chat |
| Generative Visual Question Answering | VideoInstruct | gpt-score | 1.94 | Video Chat |
| Generative Visual Question Answering | VideoInstruct | gpt-score | 2.24 | Video Chat |
| Video-based Generative Performance Benchmarking (Correctness of Information) | VideoInstruct | gpt-score | 2.32 | Video Chat |
| Video-based Generative Performance Benchmarking | VideoInstruct | Consistency | 2.24 | Video Chat |
| Video-based Generative Performance Benchmarking | VideoInstruct | Contextual Understanding | 2.53 | Video Chat |
| Video-based Generative Performance Benchmarking | VideoInstruct | Correctness of Information | 2.23 | Video Chat |
| Video-based Generative Performance Benchmarking | VideoInstruct | Detail Orientation | 2.5 | Video Chat |
| Video-based Generative Performance Benchmarking | VideoInstruct | Temporal Understanding | 1.94 | Video Chat |
| Video-based Generative Performance Benchmarking | VideoInstruct | mean | 2.29 | Video Chat |
| Video-based Generative Performance Benchmarking | VideoInstruct | gpt-score | 2.53 | Video Chat |
| Video-based Generative Performance Benchmarking | VideoInstruct | gpt-score | 2.32 | Video Chat |
| Video-based Generative Performance Benchmarking | VideoInstruct | gpt-score | 2.5 | Video Chat |
| Video-based Generative Performance Benchmarking | VideoInstruct | gpt-score | 1.94 | Video Chat |
| Video-based Generative Performance Benchmarking | VideoInstruct | gpt-score | 2.24 | Video Chat |