TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Video Instruction Tuning With Synthetic Data

Video Instruction Tuning With Synthetic Data

Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, Chunyuan Li

2024-10-03Zero-Shot Video Question AnswerQuestion AnsweringInstruction FollowingVideo Question AnsweringOpen-Ended Question Answering3D Question Answering (3D-QA)Visual Question Answering (VQA)Multiple-choice
PaperPDF

Abstract

The development of video large multimodal models (LMMs) has been hindered by the difficulty of curating large amounts of high-quality raw data from the web. To address this, we propose an alternative approach by creating a high-quality synthetic dataset specifically for video instruction-following, namely LLaVA-Video-178K. This dataset includes key tasks such as detailed captioning, open-ended question-answering (QA), and multiple-choice QA. By training on this dataset, in combination with existing visual instruction tuning data, we introduce LLaVA-Video, a new video LMM. Our experiments demonstrate that LLaVA-Video achieves strong performance across various video benchmarks, highlighting the effectiveness of our dataset. We plan to release the dataset, its generation pipeline, and the model checkpoints.

Results

TaskDatasetMetricValueModel
Question AnsweringZero-shot Video Question Answering on LongVideoBenchAccuracy (% )61.9LLaVA-Video
Visual Question Answering (VQA)VLM2-BenchAverage Score on VLM2-bench (9 subtasks)43.32LLaVA-Video-7B
Visual Question Answering (VQA)VLM2-BenchGC-mat18.53LLaVA-Video-7B
Visual Question Answering (VQA)VLM2-BenchGC-trk12.79LLaVA-Video-7B
Visual Question Answering (VQA)VLM2-BenchOC-cnt62.47LLaVA-Video-7B
Visual Question Answering (VQA)VLM2-BenchOC-cpr54.72LLaVA-Video-7B
Visual Question Answering (VQA)VLM2-BenchOC-grp28.5LLaVA-Video-7B
Visual Question Answering (VQA)VLM2-BenchPC-VID59LLaVA-Video-7B
Visual Question Answering (VQA)VLM2-BenchPC-cnt66.91LLaVA-Video-7B
Visual Question Answering (VQA)VLM2-BenchPC-cpr62LLaVA-Video-7B
Visual Question Answering (VQA)VLM2-BenchPC-grp25LLaVA-Video-7B
Visual Question Answering (VQA)SQA3DExact Match48.5LLaVA-Video
Video Question AnsweringTVBenchAverage Accuracy50LLaVA-Video 72B
Video Question AnsweringTVBenchAverage Accuracy45.6LLaVA-Video 7B
Video Question AnsweringNExT-QAAccuracy83.2LLaVA-Video
Video Question AnsweringZero-shot Video Question Answering on LongVideoBenchAccuracy (% )61.9LLaVA-Video

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17AnyCap Project: A Unified Framework, Dataset, and Benchmark for Controllable Omni-modal Captioning2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations2025-07-17HATS: Hindi Analogy Test Set for Evaluating Reasoning in Large Language Models2025-07-17