Video Instruction Tuning With Synthetic Data

Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, Chunyuan Li

2024-10-03Zero-Shot Video Question Answer Question Answering Instruction Following Video Question Answering Open-Ended Question Answering 3D Question Answering (3D-QA)Visual Question Answering (VQA)Multiple-choice

Paper PDF

Abstract

The development of video large multimodal models (LMMs) has been hindered by the difficulty of curating large amounts of high-quality raw data from the web. To address this, we propose an alternative approach by creating a high-quality synthetic dataset specifically for video instruction-following, namely LLaVA-Video-178K. This dataset includes key tasks such as detailed captioning, open-ended question-answering (QA), and multiple-choice QA. By training on this dataset, in combination with existing visual instruction tuning data, we introduce LLaVA-Video, a new video LMM. Our experiments demonstrate that LLaVA-Video achieves strong performance across various video benchmarks, highlighting the effectiveness of our dataset. We plan to release the dataset, its generation pipeline, and the model checkpoints.

Results

Task	Dataset	Metric	Value	Model
Question Answering	Zero-shot Video Question Answering on LongVideoBench	Accuracy (% )	61.9	LLaVA-Video
Visual Question Answering (VQA)	VLM2-Bench	Average Score on VLM2-bench (9 subtasks)	43.32	LLaVA-Video-7B
Visual Question Answering (VQA)	VLM2-Bench	GC-mat	18.53	LLaVA-Video-7B
Visual Question Answering (VQA)	VLM2-Bench	GC-trk	12.79	LLaVA-Video-7B
Visual Question Answering (VQA)	VLM2-Bench	OC-cnt	62.47	LLaVA-Video-7B
Visual Question Answering (VQA)	VLM2-Bench	OC-cpr	54.72	LLaVA-Video-7B
Visual Question Answering (VQA)	VLM2-Bench	OC-grp	28.5	LLaVA-Video-7B
Visual Question Answering (VQA)	VLM2-Bench	PC-VID	59	LLaVA-Video-7B
Visual Question Answering (VQA)	VLM2-Bench	PC-cnt	66.91	LLaVA-Video-7B
Visual Question Answering (VQA)	VLM2-Bench	PC-cpr	62	LLaVA-Video-7B
Visual Question Answering (VQA)	VLM2-Bench	PC-grp	25	LLaVA-Video-7B
Visual Question Answering (VQA)	SQA3D	Exact Match	48.5	LLaVA-Video
Video Question Answering	TVBench	Average Accuracy	50	LLaVA-Video 72B
Video Question Answering	TVBench	Average Accuracy	45.6	LLaVA-Video 7B
Video Question Answering	NExT-QA	Accuracy	83.2	LLaVA-Video
Video Question Answering	Zero-shot Video Question Answering on LongVideoBench	Accuracy (% )	61.9	LLaVA-Video

Video Instruction Tuning With Synthetic Data

Abstract

Results

Related Papers

Video Instruction Tuning With Synthetic Data

Abstract

Results

Related Papers