Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, Chunyuan Li
The development of video large multimodal models (LMMs) has been hindered by the difficulty of curating large amounts of high-quality raw data from the web. To address this, we propose an alternative approach by creating a high-quality synthetic dataset specifically for video instruction-following, namely LLaVA-Video-178K. This dataset includes key tasks such as detailed captioning, open-ended question-answering (QA), and multiple-choice QA. By training on this dataset, in combination with existing visual instruction tuning data, we introduce LLaVA-Video, a new video LMM. Our experiments demonstrate that LLaVA-Video achieves strong performance across various video benchmarks, highlighting the effectiveness of our dataset. We plan to release the dataset, its generation pipeline, and the model checkpoints.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Question Answering | Zero-shot Video Question Answering on LongVideoBench | Accuracy (% ) | 61.9 | LLaVA-Video |
| Visual Question Answering (VQA) | VLM2-Bench | Average Score on VLM2-bench (9 subtasks) | 43.32 | LLaVA-Video-7B |
| Visual Question Answering (VQA) | VLM2-Bench | GC-mat | 18.53 | LLaVA-Video-7B |
| Visual Question Answering (VQA) | VLM2-Bench | GC-trk | 12.79 | LLaVA-Video-7B |
| Visual Question Answering (VQA) | VLM2-Bench | OC-cnt | 62.47 | LLaVA-Video-7B |
| Visual Question Answering (VQA) | VLM2-Bench | OC-cpr | 54.72 | LLaVA-Video-7B |
| Visual Question Answering (VQA) | VLM2-Bench | OC-grp | 28.5 | LLaVA-Video-7B |
| Visual Question Answering (VQA) | VLM2-Bench | PC-VID | 59 | LLaVA-Video-7B |
| Visual Question Answering (VQA) | VLM2-Bench | PC-cnt | 66.91 | LLaVA-Video-7B |
| Visual Question Answering (VQA) | VLM2-Bench | PC-cpr | 62 | LLaVA-Video-7B |
| Visual Question Answering (VQA) | VLM2-Bench | PC-grp | 25 | LLaVA-Video-7B |
| Visual Question Answering (VQA) | SQA3D | Exact Match | 48.5 | LLaVA-Video |
| Video Question Answering | TVBench | Average Accuracy | 50 | LLaVA-Video 72B |
| Video Question Answering | TVBench | Average Accuracy | 45.6 | LLaVA-Video 7B |
| Video Question Answering | NExT-QA | Accuracy | 83.2 | LLaVA-Video |
| Video Question Answering | Zero-shot Video Question Answering on LongVideoBench | Accuracy (% ) | 61.9 | LLaVA-Video |