TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/InstructBLIP: Towards General-purpose Vision-Language Mode...

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, Steven Hoi

2023-05-11NeurIPS 2023 11visual instruction followingLong-Context UnderstandingVideo Question AnsweringVisual Question Answering (VQA)1 Image, 2*2 StitchingVisual Question AnsweringImage Retrieval
PaperPDFCode(official)CodeCodeCode

Abstract

Large-scale pre-training and instruction tuning have been successful at creating general-purpose language models with broad competence. However, building general-purpose vision-language models is challenging due to the rich input distributions and task diversity resulting from the additional visual input. Although vision-language pretraining has been widely studied, vision-language instruction tuning remains under-explored. In this paper, we conduct a systematic and comprehensive study on vision-language instruction tuning based on the pretrained BLIP-2 models. We gather 26 publicly available datasets, covering a wide variety of tasks and capabilities, and transform them into instruction tuning format. Additionally, we introduce an instruction-aware Query Transformer, which extracts informative features tailored to the given instruction. Trained on 13 held-in datasets, InstructBLIP attains state-of-the-art zero-shot performance across all 13 held-out datasets, substantially outperforming BLIP-2 and larger Flamingo models. Our models also lead to state-of-the-art performance when finetuned on individual downstream tasks (e.g., 90.7% accuracy on ScienceQA questions with image contexts). Furthermore, we qualitatively demonstrate the advantages of InstructBLIP over concurrent multimodal models. All InstructBLIP models are open-sourced at https://github.com/salesforce/LAVIS/tree/main/projects/instructblip.

Results

TaskDatasetMetricValueModel
Visual Question Answering (VQA)InfiMM-EvalAbductive37.76InstructBLIP
Visual Question Answering (VQA)InfiMM-EvalAnalogical20.56InstructBLIP
Visual Question Answering (VQA)InfiMM-EvalDeductive27.56InstructBLIP
Visual Question Answering (VQA)InfiMM-EvalOverall score28.02InstructBLIP
Visual Question Answering (VQA)ViP-BenchGPT-4 score (bbox)35.8InstructBLIP-13B (Visual Prompt)
Visual Question Answering (VQA)ViP-BenchGPT-4 score (human)35.2InstructBLIP-13B (Visual Prompt)
Visual Question Answering (VQA)BenchLMMGPT-3.5 score45.03InstructBLIP-13B
Visual Question Answering (VQA)BenchLMMGPT-3.5 score44.63InstructBLIP-7B
Video Question AnsweringMVBenchAvg.32.5InstructBLIP
Instruction FollowingLLaVA-Benchavg score60.9InstructBLIP-7B
Instruction FollowingLLaVA-Benchavg score58.2InstructBLIP-13B
Visual Question AnsweringViP-BenchGPT-4 score (bbox)35.8InstructBLIP-13B (Visual Prompt)
Visual Question AnsweringViP-BenchGPT-4 score (human)35.2InstructBLIP-13B (Visual Prompt)
Visual Question AnsweringBenchLMMGPT-3.5 score45.03InstructBLIP-13B
Visual Question AnsweringBenchLMMGPT-3.5 score44.63InstructBLIP-7B
Long-Context UnderstandingMMNeedle1 Image, 2*2 Stitching, Exact Accuracy3.8InstructBLIP-Flan-T5-XXL
Long-Context UnderstandingMMNeedle1 Image, 4*4 Stitching, Exact Accuracy6.2InstructBLIP-Flan-T5-XXL
Long-Context UnderstandingMMNeedle1 Image, 8*8 Stitching, Exact Accuracy2.2InstructBLIP-Flan-T5-XXL

Related Papers

VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17FAR-Net: Multi-Stage Fusion Network with Enhanced Semantic Alignment and Adaptive Reconciliation for Composed Image Retrieval2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17MGFFD-VLM: Multi-Granularity Prompt Learning for Face Forgery Detection with VLM2025-07-16Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16Ref-Long: Benchmarking the Long-context Referencing Capability of Long-context Language Models2025-07-13RadiomicsRetrieval: A Customizable Framework for Medical Image Retrieval Using Radiomics Features2025-07-11Evaluating Attribute Confusion in Fashion Text-to-Image Generation2025-07-09