Junnan Li, Dongxu Li, Silvio Savarese, Steven Hoi
The cost of vision-and-language pre-training has become increasingly prohibitive due to end-to-end training of large-scale models. This paper proposes BLIP-2, a generic and efficient pre-training strategy that bootstraps vision-language pre-training from off-the-shelf frozen pre-trained image encoders and frozen large language models. BLIP-2 bridges the modality gap with a lightweight Querying Transformer, which is pre-trained in two stages. The first stage bootstraps vision-language representation learning from a frozen image encoder. The second stage bootstraps vision-to-language generative learning from a frozen language model. BLIP-2 achieves state-of-the-art performance on various vision-language tasks, despite having significantly fewer trainable parameters than existing methods. For example, our model outperforms Flamingo80B by 8.7% on zero-shot VQAv2 with 54x fewer trainable parameters. We also demonstrate the model's emerging capabilities of zero-shot image-to-text generation that can follow natural language instructions.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Visual Question Answering (VQA) | InfoSeek | Accuracy | 14.6 | BLIP2 |
| Visual Question Answering (VQA) | OK-VQA | Accuracy | 45.9 | BLIP-2 ViT-G FlanT5 XXL (zero-shot) |
| Visual Question Answering (VQA) | OK-VQA | Accuracy | 40.7 | BLIP-2 ViT-G FlanT5 XL (zero-shot) |
| Visual Question Answering (VQA) | OK-VQA | Accuracy | 39.4 | BLIP-2 ViT-L FlanT5 XL (zero-shot) |
| Visual Question Answering (VQA) | OK-VQA | Accuracy | 36.4 | BLIP-2 ViT-G OPT 6.7B (zero-shot) |
| Visual Question Answering (VQA) | OK-VQA | Accuracy | 31.7 | BLIP-2 ViT-G OPT 2.7B (zero-shot) |
| Visual Question Answering (VQA) | OK-VQA | Accuracy | 30.2 | BLIP-2 ViT-L OPT 2.7B (zero-shot) |
| Visual Question Answering (VQA) | VQA v2 val | Accuracy | 65.2 | BLIP-2 ViT-G FlanT5 XXL (zero-shot) |
| Visual Question Answering (VQA) | VQA v2 val | Accuracy | 63.1 | BLIP-2 ViT-G FlanT5 XL (zero-shot) |
| Visual Question Answering (VQA) | VQA v2 val | Accuracy | 62.6 | BLIP-2 ViT-L FlanT5 XL (zero-shot) |
| Visual Question Answering (VQA) | VQA v2 val | Accuracy | 54.3 | BLIP-2 ViT-G OPT 6.7B (zero-shot) |
| Visual Question Answering (VQA) | VQA v2 val | Accuracy | 53.5 | BLIP-2 ViT-G OPT 2.7B (zero-shot) |
| Visual Question Answering (VQA) | VQA v2 val | Accuracy | 50.1 | BLIP-2 ViT-L OPT 2.7B (zero-shot) |
| Visual Question Answering (VQA) | PMC-VQA | Accuracy | 24.3 | BLIP-2 |
| Visual Question Answering (VQA) | InfiMM-Eval | Abductive | 18.96 | BLIP-2-OPT2.7B |
| Visual Question Answering (VQA) | InfiMM-Eval | Analogical | 7.5 | BLIP-2-OPT2.7B |
| Visual Question Answering (VQA) | InfiMM-Eval | Deductive | 2.76 | BLIP-2-OPT2.7B |
| Visual Question Answering (VQA) | InfiMM-Eval | Overall score | 19.31 | BLIP-2-OPT2.7B |
| Visual Question Answering (VQA) | GQA test-dev | Accuracy | 44.7 | BLIP-2 ViT-G FlanT5 XXL (zero-shot) |
| Visual Question Answering (VQA) | GQA test-dev | Accuracy | 44.4 | BLIP-2 ViT-L FlanT5 XL (zero-shot) |
| Visual Question Answering (VQA) | GQA test-dev | Accuracy | 44.2 | BLIP-2 ViT-G FlanT5 XL (zero-shot) |
| Visual Question Answering (VQA) | GQA test-dev | Accuracy | 36.4 | BLIP-2 ViT-G OPT 6.7B (zero-shot) |
| Visual Question Answering (VQA) | GQA test-dev | Accuracy | 34.6 | BLIP-2 ViT-G OPT 2.7B (zero-shot) |
| Visual Question Answering (VQA) | GQA test-dev | Accuracy | 33.9 | BLIP-2 ViT-L OPT 2.7B (zero-shot) |
| Visual Question Answering (VQA) | VQA v2 test-dev | Accuracy | 65 | BLIP-2 ViT-G FlanT5 XXL (zero-shot) |
| Visual Question Answering (VQA) | VQA v2 test-dev | Accuracy | 63 | BLIP-2 ViT-G FlanT5 XL (zero-shot) |
| Visual Question Answering (VQA) | VQA v2 test-dev | Accuracy | 62.3 | BLIP-2 ViT-L FlanT5 XL (zero-shot) |
| Visual Question Answering (VQA) | VQA v2 test-dev | Accuracy | 52.6 | BLIP-2 ViT-G OPT 6.7B (zero-shot) |
| Visual Question Answering (VQA) | VQA v2 test-dev | Accuracy | 52.3 | BLIP-2 ViT-G OPT 2.7B (zero-shot) |
| Visual Question Answering (VQA) | VQA v2 test-dev | Accuracy | 49.7 | BLIP-2 ViT-L OPT 2.7B (zero-shot) |
| Visual Question Answering (VQA) | VQA v2 test-dev | Accuracy | 82.3 | BLIP-2 ViT-G OPT 6.7B (fine-tuned) |
| Visual Question Answering (VQA) | VQA v2 test-dev | Accuracy | 81.74 | BLIP-2 ViT-G OPT 2.7B (fine-tuned) |
| Visual Question Answering (VQA) | VQA v2 test-dev | Accuracy | 81.66 | BLIP-2 ViT-G FlanT5 XL (fine-tuned) |
| Visual Question Answering (VQA) | VQA v2 val | Accuracy | 82.19 | BLIP-2 ViT-G OPT 6.7B (fine-tuned) |
| Visual Question Answering (VQA) | VQA v2 val | Accuracy | 81.59 | BLIP-2 ViT-G OPT 2.7B (fine-tuned) |
| Visual Question Answering (VQA) | VQA v2 val | Accuracy | 81.55 | BLIP-2 ViT-G FlanT5 XL (fine-tuned) |
| Visual Question Answering (VQA) | PMC-VQA | BLEU-1 | 7.6 | BLIP-2 |
| Image Captioning | nocaps-val-out-domain | CIDEr | 124.8 | BLIP-2 ViT-G FlanT5 XL (zero-shot) |
| Image Captioning | nocaps-val-out-domain | SPICE | 15.1 | BLIP-2 ViT-G FlanT5 XL (zero-shot) |
| Image Captioning | nocaps-val-out-domain | CIDEr | 124.4 | BLIP-2 ViT-G OPT 6.7B (zero-shot) |
| Image Captioning | nocaps-val-out-domain | SPICE | 14.8 | BLIP-2 ViT-G OPT 6.7B (zero-shot) |
| Image Captioning | nocaps-val-out-domain | CIDEr | 123.4 | BLIP-2 ViT-G OPT 2.7B (zero-shot) |
| Image Captioning | nocaps-val-out-domain | SPICE | 15.1 | BLIP-2 ViT-G OPT 2.7B (zero-shot) |
| Image Captioning | nocaps-val-near-domain | CIDEr | 120.2 | BLIP-2 ViT-G FlanT5 XL (zero-shot) |
| Image Captioning | nocaps-val-near-domain | SPICE | 15.9 | BLIP-2 ViT-G FlanT5 XL (zero-shot) |
| Image Captioning | nocaps-val-near-domain | CIDEr | 119.2 | BLIP-2 ViT-G OPT 6.7B (zero-shot) |
| Image Captioning | nocaps-val-near-domain | SPICE | 15.3 | BLIP-2 ViT-G OPT 6.7B (zero-shot) |
| Image Captioning | nocaps-val-near-domain | CIDEr | 117.8 | BLIP-2 ViT-G OPT 2.7B (zero-shot) |
| Image Captioning | nocaps-val-near-domain | SPICE | 15.4 | BLIP-2 ViT-G OPT 2.7B (zero-shot) |
| Image Captioning | COCO Captions | BLEU-4 | 43.7 | BLIP-2 ViT-G OPT 2.7B (zero-shot) |
| Image Captioning | COCO Captions | CIDER | 145.8 | BLIP-2 ViT-G OPT 2.7B (zero-shot) |
| Image Captioning | COCO Captions | BLEU-4 | 43.5 | BLIP-2 ViT-G OPT 6.7B (zero-shot) |
| Image Captioning | COCO Captions | CIDER | 145.2 | BLIP-2 ViT-G OPT 6.7B (zero-shot) |
| Image Captioning | COCO Captions | BLEU-4 | 42.4 | BLIP-2 ViT-G FlanT5 XL (zero-shot) |
| Image Captioning | COCO Captions | CIDER | 144.5 | BLIP-2 ViT-G FlanT5 XL (zero-shot) |
| Image Captioning | nocaps-val-overall | CIDEr | 121.6 | BLIP-2 ViT-G FlanT5 XL (zero-shot) |
| Image Captioning | nocaps-val-overall | SPICE | 15.8 | BLIP-2 ViT-G FlanT5 XL (zero-shot) |
| Image Captioning | nocaps-val-overall | CIDEr | 121 | BLIP-2 ViT-G OPT 6.7B (zero-shot) |
| Image Captioning | nocaps-val-overall | SPICE | 15.3 | BLIP-2 ViT-G OPT 6.7B (zero-shot) |
| Image Captioning | nocaps-val-overall | CIDEr | 119.7 | BLIP-2 ViT-G OPT 2.7B (zero-shot) |
| Image Captioning | nocaps-val-overall | SPICE | 15.4 | BLIP-2 ViT-G OPT 2.7B (zero-shot) |
| Image Captioning | nocaps-val-in-domain | CIDEr | 123.7 | BLIP-2 ViT-G FlanT5 XL (zero-shot) |
| Image Captioning | nocaps-val-in-domain | SPICE | 16.3 | BLIP-2 ViT-G FlanT5 XL (zero-shot) |
| Image Captioning | nocaps-val-in-domain | CIDEr | 123.7 | BLIP-2 ViT-G OPT 6.7B (zero-shot) |
| Image Captioning | nocaps-val-in-domain | SPICE | 15.8 | BLIP-2 ViT-G OPT 6.7B (zero-shot) |
| Image Captioning | nocaps-val-in-domain | CIDEr | 123 | BLIP-2 ViT-G OPT 2.7B (zero-shot) |
| Image Captioning | nocaps-val-in-domain | SPICE | 15.8 | BLIP-2 ViT-G OPT 2.7B (zero-shot) |
| Image Retrieval | Flickr30k | Recall@1 | 89.7 | BLIP-2 ViT-G (zero-shot, 1K test set) |
| Image Retrieval | Flickr30k | Recall@10 | 98.9 | BLIP-2 ViT-G (zero-shot, 1K test set) |
| Image Retrieval | Flickr30k | Recall@5 | 98.1 | BLIP-2 ViT-G (zero-shot, 1K test set) |
| Image Retrieval | Flickr30k | Recall@1 | 88.6 | BLIP-2 ViT-L (zero-shot, 1K test set) |
| Image Retrieval | Flickr30k | Recall@10 | 98.9 | BLIP-2 ViT-L (zero-shot, 1K test set) |
| Image Retrieval | Flickr30k | Recall@5 | 97.6 | BLIP-2 ViT-L (zero-shot, 1K test set) |
| Image Retrieval | COCO (Common Objects in Context) | Recall@10 | 92.6 | BLIP-2 ViT-G (fine-tuned) |
| Image Retrieval | COCO (Common Objects in Context) | recall@1 | 68.3 | BLIP-2 ViT-G (fine-tuned) |
| Image Retrieval | COCO (Common Objects in Context) | recall@5 | 87.7 | BLIP-2 ViT-G (fine-tuned) |
| Image Retrieval | COCO (Common Objects in Context) | Recall@10 | 91.8 | BLIP-2 ViT-L (fine-tuned) |
| Image Retrieval | COCO (Common Objects in Context) | recall@1 | 66.3 | BLIP-2 ViT-L (fine-tuned) |
| Image Retrieval | COCO (Common Objects in Context) | recall@5 | 86.5 | BLIP-2 ViT-L (fine-tuned) |
| Object Detection | OVAD-Box benchmark | mean average precision | 25.5 | BLIP 2 (pretrained) |
| 3D | OVAD-Box benchmark | mean average precision | 25.5 | BLIP 2 (pretrained) |
| 2D Classification | OVAD-Box benchmark | mean average precision | 25.5 | BLIP 2 (pretrained) |
| 2D Object Detection | OVAD-Box benchmark | mean average precision | 25.5 | BLIP 2 (pretrained) |
| Instruction Following | LLaVA-Bench | avg score | 38.1 | BLIP-2 |
| Open Vocabulary Object Detection | OVAD-Box benchmark | mean average precision | 25.5 | BLIP 2 (pretrained) |
| Generative Visual Question Answering | PMC-VQA | BLEU-1 | 7.6 | BLIP-2 |
| Visual Question Answering | VQA v2 test-dev | Accuracy | 82.3 | BLIP-2 ViT-G OPT 6.7B (fine-tuned) |
| Visual Question Answering | VQA v2 test-dev | Accuracy | 81.74 | BLIP-2 ViT-G OPT 2.7B (fine-tuned) |
| Visual Question Answering | VQA v2 test-dev | Accuracy | 81.66 | BLIP-2 ViT-G FlanT5 XL (fine-tuned) |
| Visual Question Answering | VQA v2 val | Accuracy | 82.19 | BLIP-2 ViT-G OPT 6.7B (fine-tuned) |
| Visual Question Answering | VQA v2 val | Accuracy | 81.59 | BLIP-2 ViT-G OPT 2.7B (fine-tuned) |
| Visual Question Answering | VQA v2 val | Accuracy | 81.55 | BLIP-2 ViT-G FlanT5 XL (fine-tuned) |
| Image-to-Text Retrieval | Flickr30k | Recall@1 | 97.6 | BLIP-2 ViT-G (zero-shot, 1K test set) |
| Image-to-Text Retrieval | Flickr30k | Recall@10 | 100 | BLIP-2 ViT-G (zero-shot, 1K test set) |
| Image-to-Text Retrieval | Flickr30k | Recall@5 | 100 | BLIP-2 ViT-G (zero-shot, 1K test set) |
| Image-to-Text Retrieval | Flickr30k | Recall@1 | 96.9 | BLIP-2 ViT-L (zero-shot, 1K test set) |
| Image-to-Text Retrieval | Flickr30k | Recall@10 | 100 | BLIP-2 ViT-L (zero-shot, 1K test set) |
| Image-to-Text Retrieval | Flickr30k | Recall@5 | 100 | BLIP-2 ViT-L (zero-shot, 1K test set) |
| Image-to-Text Retrieval | COCO (Common Objects in Context) | Recall@1 | 85.4 | BLIP-2 (ViT-G, fine-tuned) |
| Image-to-Text Retrieval | COCO (Common Objects in Context) | Recall@10 | 98.5 | BLIP-2 (ViT-G, fine-tuned) |
| Image-to-Text Retrieval | COCO (Common Objects in Context) | Recall@5 | 97 | BLIP-2 (ViT-G, fine-tuned) |
| Image-to-Text Retrieval | COCO (Common Objects in Context) | Recall@1 | 83.5 | BLIP-2 (ViT-L, fine-tuned) |
| Image-to-Text Retrieval | COCO (Common Objects in Context) | Recall@10 | 98 | BLIP-2 (ViT-L, fine-tuned) |
| Image-to-Text Retrieval | COCO (Common Objects in Context) | Recall@5 | 96 | BLIP-2 (ViT-L, fine-tuned) |
| 16k | OVAD-Box benchmark | mean average precision | 25.5 | BLIP 2 (pretrained) |