Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, Steven Hoi
Large-scale pre-training and instruction tuning have been successful at creating general-purpose language models with broad competence. However, building general-purpose vision-language models is challenging due to the rich input distributions and task diversity resulting from the additional visual input. Although vision-language pretraining has been widely studied, vision-language instruction tuning remains under-explored. In this paper, we conduct a systematic and comprehensive study on vision-language instruction tuning based on the pretrained BLIP-2 models. We gather 26 publicly available datasets, covering a wide variety of tasks and capabilities, and transform them into instruction tuning format. Additionally, we introduce an instruction-aware Query Transformer, which extracts informative features tailored to the given instruction. Trained on 13 held-in datasets, InstructBLIP attains state-of-the-art zero-shot performance across all 13 held-out datasets, substantially outperforming BLIP-2 and larger Flamingo models. Our models also lead to state-of-the-art performance when finetuned on individual downstream tasks (e.g., 90.7% accuracy on ScienceQA questions with image contexts). Furthermore, we qualitatively demonstrate the advantages of InstructBLIP over concurrent multimodal models. All InstructBLIP models are open-sourced at https://github.com/salesforce/LAVIS/tree/main/projects/instructblip.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Visual Question Answering (VQA) | InfiMM-Eval | Abductive | 37.76 | InstructBLIP |
| Visual Question Answering (VQA) | InfiMM-Eval | Analogical | 20.56 | InstructBLIP |
| Visual Question Answering (VQA) | InfiMM-Eval | Deductive | 27.56 | InstructBLIP |
| Visual Question Answering (VQA) | InfiMM-Eval | Overall score | 28.02 | InstructBLIP |
| Visual Question Answering (VQA) | ViP-Bench | GPT-4 score (bbox) | 35.8 | InstructBLIP-13B (Visual Prompt) |
| Visual Question Answering (VQA) | ViP-Bench | GPT-4 score (human) | 35.2 | InstructBLIP-13B (Visual Prompt) |
| Visual Question Answering (VQA) | BenchLMM | GPT-3.5 score | 45.03 | InstructBLIP-13B |
| Visual Question Answering (VQA) | BenchLMM | GPT-3.5 score | 44.63 | InstructBLIP-7B |
| Video Question Answering | MVBench | Avg. | 32.5 | InstructBLIP |
| Instruction Following | LLaVA-Bench | avg score | 60.9 | InstructBLIP-7B |
| Instruction Following | LLaVA-Bench | avg score | 58.2 | InstructBLIP-13B |
| Visual Question Answering | ViP-Bench | GPT-4 score (bbox) | 35.8 | InstructBLIP-13B (Visual Prompt) |
| Visual Question Answering | ViP-Bench | GPT-4 score (human) | 35.2 | InstructBLIP-13B (Visual Prompt) |
| Visual Question Answering | BenchLMM | GPT-3.5 score | 45.03 | InstructBLIP-13B |
| Visual Question Answering | BenchLMM | GPT-3.5 score | 44.63 | InstructBLIP-7B |
| Long-Context Understanding | MMNeedle | 1 Image, 2*2 Stitching, Exact Accuracy | 3.8 | InstructBLIP-Flan-T5-XXL |
| Long-Context Understanding | MMNeedle | 1 Image, 4*4 Stitching, Exact Accuracy | 6.2 | InstructBLIP-Flan-T5-XXL |
| Long-Context Understanding | MMNeedle | 1 Image, 8*8 Stitching, Exact Accuracy | 2.2 | InstructBLIP-Flan-T5-XXL |