TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Visual Instruction Tuning

Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, Yong Jae Lee

2023-04-17NeurIPS 2023 11Spatial Reasoningvisual instruction followingInstruction FollowingMMR totalImage ClassificationReferring expression generationReferring Expression ComprehensionVideo Question Answering3D Question Answering (3D-QA)Visual Reasoning1 Image, 2*2 StitchingVisual Question AnsweringImage Retrieval
PaperPDFCodeCodeCodeCode(official)CodeCodeCodeCodeCodeCodeCodeCodeCode

Abstract

Instruction tuning large language models (LLMs) using machine-generated instruction-following data has improved zero-shot capabilities on new tasks, but the idea is less explored in the multimodal field. In this paper, we present the first attempt to use language-only GPT-4 to generate multimodal language-image instruction-following data. By instruction tuning on such generated data, we introduce LLaVA: Large Language and Vision Assistant, an end-to-end trained large multimodal model that connects a vision encoder and LLM for general-purpose visual and language understanding.Our early experiments show that LLaVA demonstrates impressive multimodel chat abilities, sometimes exhibiting the behaviors of multimodal GPT-4 on unseen images/instructions, and yields a 85.1% relative score compared with GPT-4 on a synthetic multimodal instruction-following dataset. When fine-tuned on Science QA, the synergy of LLaVA and GPT-4 achieves a new state-of-the-art accuracy of 92.53%. We make GPT-4 generated visual instruction tuning data, our model and code base publicly available.

Results

TaskDatasetMetricValueModel
Visual Question Answering (VQA)BenchLMMGPT-3.5 score46.83LLaVA-1.5-7B
Visual Question Answering (VQA)BenchLMMGPT-3.5 score43.5LLaVA-1-13B
Visual Question Answering (VQA)EmbSpatial-BenchGeneration35.19LLaVA-1.6
Visual Question Answering (VQA)ScanQA Test w/ objectsBLEU-413.5LL3DA
Visual Question Answering (VQA)ScanQA Test w/ objectsCIDEr76.8LL3DA
Visual Question Answering (VQA)ScanQA Test w/ objectsMETEOR15.9LL3DA
Visual Question Answering (VQA)ScanQA Test w/ objectsROUGE37.3LL3DA
Video Question AnsweringMVBenchAvg.36LLaVa
Image ClassificationColonINST-v1 (Seen)Accuray89.61LLaVA-v1 (w/ LoRA, w/ extra data)
Image ClassificationColonINST-v1 (Seen)Accuray87.86LLaVA-v1 (w/ LoRA, w/o extra data)
Image ClassificationColonINST-v1 (Unseen)Accuray72.08LLaVA-v1 (w/ LoRA, w/o extra data)
Image ClassificationColonINST-v1 (Unseen)Accuray42.17LLaVA-v1 (w/ LoRA, w/ extra data)
Referring expression generationColonINST-v1 (Unseen)Accuray68.11LLaVA-v1 (w/ LoRA, w/o extra data)
Referring expression generationColonINST-v1 (Unseen)Accuray46.85LLaVA-v1 (w/ LoRA, w/ extra data)
Referring expression generationColonINST-v1 (Seen)Accuray86.87LLaVA-v1 (w/ LoRA, w/ extra data)
Referring expression generationColonINST-v1 (Seen)Accuray84.55LLaVA-v1 (w/ LoRA, w/o extra data)
Visual Question AnsweringBenchLMMGPT-3.5 score46.83LLaVA-1.5-7B
Visual Question AnsweringBenchLMMGPT-3.5 score43.5LLaVA-1-13B
Visual Question AnsweringEmbSpatial-BenchGeneration35.19LLaVA-1.6
MMR totalMRR-BenchmarkTotal Column Score412LLaVA-NEXT-34B
MMR totalMRR-BenchmarkTotal Column Score335LLaVA-NEXT-13B
MMR totalMRR-BenchmarkTotal Column Score243LLaVA-1.5-13B

Related Papers

Automatic Classification and Segmentation of Tunnel Cracks Based on Deep Learning and Visual Explanations2025-07-18AnyCap Project: A Unified Framework, Dataset, and Benchmark for Controllable Omni-modal Captioning2025-07-17Adversarial attacks to image classification systems using evolutionary algorithms2025-07-17Efficient Adaptation of Pre-trained Vision Transformer underpinned by Approximately Orthogonal Fine-Tuning Strategy2025-07-17Federated Learning for Commercial Image Sources2025-07-17MUPAX: Multidimensional Problem Agnostic eXplainable AI2025-07-17LaViPlan : Language-Guided Visual Path Planning with RLVR2025-07-17FAR-Net: Multi-Stage Fusion Network with Enhanced Semantic Alignment and Adaptive Reconciliation for Composed Image Retrieval2025-07-17