TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/LLaVA-OneVision: Easy Visual Task Transfer

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, Chunyuan Li

2024-08-06Zero-Shot Video Question AnswerTransfer LearningVideo Question Answering3D Question Answering (3D-QA)Video UnderstandingVisual Question Answering (VQA)Temporal Relation ExtractionMultiple-choiceVisual Question Answering
PaperPDFCodeCode(official)

Abstract

We present LLaVA-OneVision, a family of open large multimodal models (LMMs) developed by consolidating our insights into data, models, and visual representations in the LLaVA-NeXT blog series. Our experimental results demonstrate that LLaVA-OneVision is the first single model that can simultaneously push the performance boundaries of open LMMs in three important computer vision scenarios: single-image, multi-image, and video scenarios. Importantly, the design of LLaVA-OneVision allows strong transfer learning across different modalities/scenarios, yielding new emerging capabilities. In particular, strong video understanding and cross-scenario capabilities are demonstrated through task transfer from images to videos.

Results

TaskDatasetMetricValueModel
Relation ExtractionVinogroundGroup Score21.8LLaVA-OneVision-Qwen2-72B
Relation ExtractionVinogroundText Score48.4LLaVA-OneVision-Qwen2-72B
Relation ExtractionVinogroundVideo Score35.2LLaVA-OneVision-Qwen2-72B
Relation ExtractionVinogroundGroup Score14.6LLaVA-OneVision-Qwen2-7B
Relation ExtractionVinogroundText Score41.6LLaVA-OneVision-Qwen2-7B
Relation ExtractionVinogroundVideo Score29.4LLaVA-OneVision-Qwen2-7B
Question AnsweringVNBenchAccuracy58.7LLaVA-OneVision-72B
Question AnsweringVNBenchAccuracy51.8LLaVA-OneVision-7B
Visual Question Answering (VQA)VLM2-BenchAverage Score on VLM2-bench (9 subtasks)39.35LLaVA-OneVision-7B
Visual Question Answering (VQA)VLM2-BenchGC-mat16.6LLaVA-OneVision-7B
Visual Question Answering (VQA)VLM2-BenchGC-trk13.7LLaVA-OneVision-7B
Visual Question Answering (VQA)VLM2-BenchOC-cnt56.17LLaVA-OneVision-7B
Visual Question Answering (VQA)VLM2-BenchOC-cpr47.22LLaVA-OneVision-7B
Visual Question Answering (VQA)VLM2-BenchOC-grp27.5LLaVA-OneVision-7B
Visual Question Answering (VQA)VLM2-BenchPC-VID47.25LLaVA-OneVision-7B
Visual Question Answering (VQA)VLM2-BenchPC-cnt46.67LLaVA-OneVision-7B
Visual Question Answering (VQA)VLM2-BenchPC-cpr62LLaVA-OneVision-7B
Visual Question Answering (VQA)VLM2-BenchPC-grp37LLaVA-OneVision-7B
Visual Question Answering (VQA)MM-VetGPT-4 score63.7LLaVA-OneVision-72B
Visual Question Answering (VQA)MM-VetGPT-4 score57.5LLaVA-OneVision-7B
Visual Question Answering (VQA)MM-VetGPT-4 score29.1LLaVA-OneVision-0.5B
Visual Question Answering (VQA)V*benchAccuracy74.46LLaVA-OneVision7B
Visual Question Answering (VQA)SQA3DExact Match34.2LLaVA-NeXT-Video
Visual Question Answering (VQA)ScanQA Test w/ objectsBLEU-49.8LLaVA-NeXT-Video
Visual Question Answering (VQA)ScanQA Test w/ objectsCIDEr46.2LLaVA-NeXT-Video
Visual Question Answering (VQA)ScanQA Test w/ objectsExact Match18.7LLaVA-NeXT-Video
Visual Question Answering (VQA)ScanQA Test w/ objectsMETEOR9.1LLaVA-NeXT-Video
Visual Question Answering (VQA)ScanQA Test w/ objectsROUGE27.8LLaVA-NeXT-Video
Video Question AnsweringOVBenchAVG49.5LLaVA-OneVision (7B)
Video Question AnsweringNExT-QAAccuracy80.2LLaVA-OV(72B)
Video Question AnsweringNExT-QAAccuracy79.4LLaVA-OV(7B)
Video Question AnsweringVNBenchAccuracy58.7LLaVA-OneVision-72B
Video Question AnsweringVNBenchAccuracy51.8LLaVA-OneVision-7B
Temporal Relation ExtractionVinogroundGroup Score21.8LLaVA-OneVision-Qwen2-72B
Temporal Relation ExtractionVinogroundText Score48.4LLaVA-OneVision-Qwen2-72B
Temporal Relation ExtractionVinogroundVideo Score35.2LLaVA-OneVision-Qwen2-72B
Temporal Relation ExtractionVinogroundGroup Score14.6LLaVA-OneVision-Qwen2-7B
Temporal Relation ExtractionVinogroundText Score41.6LLaVA-OneVision-Qwen2-7B
Temporal Relation ExtractionVinogroundVideo Score29.4LLaVA-OneVision-Qwen2-7B
Visual Question AnsweringMM-VetGPT-4 score63.7LLaVA-OneVision-72B
Visual Question AnsweringMM-VetGPT-4 score57.5LLaVA-OneVision-7B
Visual Question AnsweringMM-VetGPT-4 score29.1LLaVA-OneVision-0.5B
Visual Question AnsweringV*benchAccuracy74.46LLaVA-OneVision7B

Related Papers

RaMen: Multi-Strategy Multi-Modal Learning for Bundle Construction2025-07-18Disentangling coincident cell events using deep transfer learning and compressive sensing2025-07-17VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations2025-07-17HATS: Hindi Analogy Test Set for Evaluating Reasoning in Large Language Models2025-07-17Best Practices for Large-Scale, Pixel-Wise Crop Mapping and Transfer Learning Workflows2025-07-16MGFFD-VLM: Multi-Granularity Prompt Learning for Face Forgery Detection with VLM2025-07-16