InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, Steven Hoi

2023-05-11NeurIPS 2023 11visual instruction following Long-Context Understanding Video Question Answering Visual Question Answering (VQA)1 Image, 2*2 Stitching Visual Question Answering Image Retrieval

Paper PDF Code(official)Code Code Code

Abstract

Large-scale pre-training and instruction tuning have been successful at creating general-purpose language models with broad competence. However, building general-purpose vision-language models is challenging due to the rich input distributions and task diversity resulting from the additional visual input. Although vision-language pretraining has been widely studied, vision-language instruction tuning remains under-explored. In this paper, we conduct a systematic and comprehensive study on vision-language instruction tuning based on the pretrained BLIP-2 models. We gather 26 publicly available datasets, covering a wide variety of tasks and capabilities, and transform them into instruction tuning format. Additionally, we introduce an instruction-aware Query Transformer, which extracts informative features tailored to the given instruction. Trained on 13 held-in datasets, InstructBLIP attains state-of-the-art zero-shot performance across all 13 held-out datasets, substantially outperforming BLIP-2 and larger Flamingo models. Our models also lead to state-of-the-art performance when finetuned on individual downstream tasks (e.g., 90.7% accuracy on ScienceQA questions with image contexts). Furthermore, we qualitatively demonstrate the advantages of InstructBLIP over concurrent multimodal models. All InstructBLIP models are open-sourced at https://github.com/salesforce/LAVIS/tree/main/projects/instructblip.

Results

Task	Dataset	Metric	Value	Model
Visual Question Answering (VQA)	InfiMM-Eval	Abductive	37.76	InstructBLIP
Visual Question Answering (VQA)	InfiMM-Eval	Analogical	20.56	InstructBLIP
Visual Question Answering (VQA)	InfiMM-Eval	Deductive	27.56	InstructBLIP
Visual Question Answering (VQA)	InfiMM-Eval	Overall score	28.02	InstructBLIP
Visual Question Answering (VQA)	ViP-Bench	GPT-4 score (bbox)	35.8	InstructBLIP-13B (Visual Prompt)
Visual Question Answering (VQA)	ViP-Bench	GPT-4 score (human)	35.2	InstructBLIP-13B (Visual Prompt)
Visual Question Answering (VQA)	BenchLMM	GPT-3.5 score	45.03	InstructBLIP-13B
Visual Question Answering (VQA)	BenchLMM	GPT-3.5 score	44.63	InstructBLIP-7B
Video Question Answering	MVBench	Avg.	32.5	InstructBLIP
Instruction Following	LLaVA-Bench	avg score	60.9	InstructBLIP-7B
Instruction Following	LLaVA-Bench	avg score	58.2	InstructBLIP-13B
Visual Question Answering	ViP-Bench	GPT-4 score (bbox)	35.8	InstructBLIP-13B (Visual Prompt)
Visual Question Answering	ViP-Bench	GPT-4 score (human)	35.2	InstructBLIP-13B (Visual Prompt)
Visual Question Answering	BenchLMM	GPT-3.5 score	45.03	InstructBLIP-13B
Visual Question Answering	BenchLMM	GPT-3.5 score	44.63	InstructBLIP-7B
Long-Context Understanding	MMNeedle	1 Image, 2*2 Stitching, Exact Accuracy	3.8	InstructBLIP-Flan-T5-XXL
Long-Context Understanding	MMNeedle	1 Image, 4*4 Stitching, Exact Accuracy	6.2	InstructBLIP-Flan-T5-XXL
Long-Context Understanding	MMNeedle	1 Image, 8*8 Stitching, Exact Accuracy	2.2	InstructBLIP-Flan-T5-XXL

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

Abstract

Results

Related Papers

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

Abstract

Results

Related Papers