Muhammad Maaz, Hanoona Rasheed, Salman Khan, Fahad Shahbaz Khan
Conversation agents fueled by Large Language Models (LLMs) are providing a new way to interact with visual data. While there have been initial attempts for image-based conversation models, this work addresses the under-explored field of \emph{video-based conversation} by introducing Video-ChatGPT. It is a multimodal model that merges a video-adapted visual encoder with an LLM. The resulting model is capable of understanding and generating detailed conversations about videos. We introduce a new dataset of 100,000 video-instruction pairs used to train Video-ChatGPT acquired via manual and semi-automated pipeline that is easily scalable and robust to label noise. We also develop a quantitative evaluation framework for video-based dialogue models to objectively analyze the strengths and weaknesses of video-based dialogue models. Code: https://github.com/mbzuai-oryx/Video-ChatGPT.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Question Answering | NExT-QA (Open-ended VideoQA) | Accuracy | 54.6 | Video-ChatGPT |
| Question Answering | NExT-QA (Open-ended VideoQA) | Confidence Score | 3.2 | Video-ChatGPT |
| Question Answering | VNBench | Accuracy | 4.1 | VideoChatGPT |
| Question Answering | MSVD-QA | Accuracy | 64.9 | Video-ChatGPT-7B |
| Question Answering | MSVD-QA | Confidence Score | 3.3 | Video-ChatGPT-7B |
| Question Answering | TGIF-QA | Accuracy | 51.4 | Video-ChatGPT-7B |
| Question Answering | TGIF-QA | Confidence Score | 3 | Video-ChatGPT-7B |
| Question Answering | MSRVTT-QA | Accuracy | 49.3 | Video-ChatGPT-7B |
| Question Answering | MSRVTT-QA | Confidence Score | 2.8 | Video-ChatGPT-7B |
| Question Answering | ActivityNet-QA | Accuracy | 35.2 | Video-ChatGPT |
| Question Answering | ActivityNet-QA | Confidence Score | 2.7 | Video-ChatGPT |
| Visual Question Answering (VQA) | VideoInstruct | Consistency | 2.37 | Video-ChatGPT |
| Visual Question Answering (VQA) | VideoInstruct | Contextual Understanding | 2.62 | Video-ChatGPT |
| Visual Question Answering (VQA) | VideoInstruct | Correctness of Information | 2.4 | Video-ChatGPT |
| Visual Question Answering (VQA) | VideoInstruct | Detail Orientation | 2.52 | Video-ChatGPT |
| Visual Question Answering (VQA) | VideoInstruct | Temporal Understanding | 1.98 | Video-ChatGPT |
| Visual Question Answering (VQA) | VideoInstruct | mean | 2.38 | Video-ChatGPT |
| Visual Question Answering (VQA) | VideoInstruct | gpt-score | 2.62 | Video-ChatGPT |
| Visual Question Answering (VQA) | VideoInstruct | gpt-score | 2.4 | Video-ChatGPT |
| Visual Question Answering (VQA) | VideoInstruct | gpt-score | 2.52 | Video-ChatGPT |
| Visual Question Answering (VQA) | VideoInstruct | gpt-score | 1.98 | Video-ChatGPT |
| Visual Question Answering (VQA) | VideoInstruct | gpt-score | 2.37 | Video-ChatGPT |
| Video Question Answering | ActivityNet-QA | Accuracy | 35.2 | Video-ChatGPT |
| Video Question Answering | ActivityNet-QA | Confidence score | 2.7 | Video-ChatGPT |
| Video Question Answering | MVBench | Avg. | 32.7 | Video-ChatGPT |
| Video Question Answering | VNBench | Accuracy | 4.1 | VideoChatGPT |
| Video Question Answering | MSVD-QA | Accuracy | 64.9 | Video-ChatGPT-7B |
| Video Question Answering | MSVD-QA | Confidence Score | 3.3 | Video-ChatGPT-7B |
| Video Question Answering | TGIF-QA | Accuracy | 51.4 | Video-ChatGPT-7B |
| Video Question Answering | TGIF-QA | Confidence Score | 3 | Video-ChatGPT-7B |
| Video Question Answering | MSRVTT-QA | Accuracy | 49.3 | Video-ChatGPT-7B |
| Video Question Answering | MSRVTT-QA | Confidence Score | 2.8 | Video-ChatGPT-7B |
| Video Question Answering | ActivityNet-QA | Accuracy | 35.2 | Video-ChatGPT |
| Video Question Answering | ActivityNet-QA | Confidence Score | 2.7 | Video-ChatGPT |
| Generative Visual Question Answering | VideoInstruct | Consistency | 2.37 | Video-ChatGPT |
| Generative Visual Question Answering | VideoInstruct | Contextual Understanding | 2.62 | Video-ChatGPT |
| Generative Visual Question Answering | VideoInstruct | Correctness of Information | 2.4 | Video-ChatGPT |
| Generative Visual Question Answering | VideoInstruct | Detail Orientation | 2.52 | Video-ChatGPT |
| Generative Visual Question Answering | VideoInstruct | Temporal Understanding | 1.98 | Video-ChatGPT |
| Generative Visual Question Answering | VideoInstruct | mean | 2.38 | Video-ChatGPT |
| Generative Visual Question Answering | VideoInstruct | gpt-score | 2.62 | Video-ChatGPT |
| Generative Visual Question Answering | VideoInstruct | gpt-score | 2.4 | Video-ChatGPT |
| Generative Visual Question Answering | VideoInstruct | gpt-score | 2.52 | Video-ChatGPT |
| Generative Visual Question Answering | VideoInstruct | gpt-score | 1.98 | Video-ChatGPT |
| Generative Visual Question Answering | VideoInstruct | gpt-score | 2.37 | Video-ChatGPT |
| Video-based Generative Performance Benchmarking (Correctness of Information) | VideoInstruct | gpt-score | 2.4 | Video-ChatGPT |
| Video-based Generative Performance Benchmarking | VideoInstruct | Consistency | 2.37 | Video-ChatGPT |
| Video-based Generative Performance Benchmarking | VideoInstruct | Contextual Understanding | 2.62 | Video-ChatGPT |
| Video-based Generative Performance Benchmarking | VideoInstruct | Correctness of Information | 2.4 | Video-ChatGPT |
| Video-based Generative Performance Benchmarking | VideoInstruct | Detail Orientation | 2.52 | Video-ChatGPT |
| Video-based Generative Performance Benchmarking | VideoInstruct | Temporal Understanding | 1.98 | Video-ChatGPT |
| Video-based Generative Performance Benchmarking | VideoInstruct | mean | 2.38 | Video-ChatGPT |
| Video-based Generative Performance Benchmarking | VideoInstruct | gpt-score | 2.62 | Video-ChatGPT |
| Video-based Generative Performance Benchmarking | VideoInstruct | gpt-score | 2.4 | Video-ChatGPT |
| Video-based Generative Performance Benchmarking | VideoInstruct | gpt-score | 2.52 | Video-ChatGPT |
| Video-based Generative Performance Benchmarking | VideoInstruct | gpt-score | 1.98 | Video-ChatGPT |
| Video-based Generative Performance Benchmarking | VideoInstruct | gpt-score | 2.37 | Video-ChatGPT |
| VCGBench-Diverse | VideoInstruct | Consistency | 2.06 | Video-ChatGPT |
| VCGBench-Diverse | VideoInstruct | Contextual Understanding | 2.46 | Video-ChatGPT |
| VCGBench-Diverse | VideoInstruct | Correctness of Information | 2.07 | Video-ChatGPT |
| VCGBench-Diverse | VideoInstruct | Dense Captioning | 0.89 | Video-ChatGPT |
| VCGBench-Diverse | VideoInstruct | Detail Orientation | 2.42 | Video-ChatGPT |
| VCGBench-Diverse | VideoInstruct | Reasoning | 3.6 | Video-ChatGPT |
| VCGBench-Diverse | VideoInstruct | Spatial Understanding | 2.25 | Video-ChatGPT |
| VCGBench-Diverse | VideoInstruct | Temporal Understanding | 1.39 | Video-ChatGPT |
| VCGBench-Diverse | VideoInstruct | mean | 2.08 | Video-ChatGPT |