TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/BLIP-2: Bootstrapping Language-Image Pre-training with Fro...

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Junnan Li, Dongxu Li, Silvio Savarese, Steven Hoi

2023-01-30Conference 2023 2visual instruction followingText GenerationGenerative Visual Question AnsweringOpen Vocabulary Attribute DetectionRepresentation LearningImage to textZero-shot Text-to-Image RetrievalImage CaptioningVisual ReasoningImage-to-Text RetrievalVisual Question Answering (VQA)Medical Visual Question AnsweringLanguage ModellingMultiple-choiceVisual Question AnsweringImage Retrieval
PaperPDFCode(official)CodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCode

Abstract

The cost of vision-and-language pre-training has become increasingly prohibitive due to end-to-end training of large-scale models. This paper proposes BLIP-2, a generic and efficient pre-training strategy that bootstraps vision-language pre-training from off-the-shelf frozen pre-trained image encoders and frozen large language models. BLIP-2 bridges the modality gap with a lightweight Querying Transformer, which is pre-trained in two stages. The first stage bootstraps vision-language representation learning from a frozen image encoder. The second stage bootstraps vision-to-language generative learning from a frozen language model. BLIP-2 achieves state-of-the-art performance on various vision-language tasks, despite having significantly fewer trainable parameters than existing methods. For example, our model outperforms Flamingo80B by 8.7% on zero-shot VQAv2 with 54x fewer trainable parameters. We also demonstrate the model's emerging capabilities of zero-shot image-to-text generation that can follow natural language instructions.

Results

TaskDatasetMetricValueModel
Visual Question Answering (VQA)InfoSeekAccuracy14.6BLIP2
Visual Question Answering (VQA)OK-VQAAccuracy45.9BLIP-2 ViT-G FlanT5 XXL (zero-shot)
Visual Question Answering (VQA)OK-VQAAccuracy40.7BLIP-2 ViT-G FlanT5 XL (zero-shot)
Visual Question Answering (VQA)OK-VQAAccuracy39.4BLIP-2 ViT-L FlanT5 XL (zero-shot)
Visual Question Answering (VQA)OK-VQAAccuracy36.4BLIP-2 ViT-G OPT 6.7B (zero-shot)
Visual Question Answering (VQA)OK-VQAAccuracy31.7BLIP-2 ViT-G OPT 2.7B (zero-shot)
Visual Question Answering (VQA)OK-VQAAccuracy30.2BLIP-2 ViT-L OPT 2.7B (zero-shot)
Visual Question Answering (VQA)VQA v2 valAccuracy65.2BLIP-2 ViT-G FlanT5 XXL (zero-shot)
Visual Question Answering (VQA)VQA v2 valAccuracy63.1BLIP-2 ViT-G FlanT5 XL (zero-shot)
Visual Question Answering (VQA)VQA v2 valAccuracy62.6BLIP-2 ViT-L FlanT5 XL (zero-shot)
Visual Question Answering (VQA)VQA v2 valAccuracy54.3BLIP-2 ViT-G OPT 6.7B (zero-shot)
Visual Question Answering (VQA)VQA v2 valAccuracy53.5BLIP-2 ViT-G OPT 2.7B (zero-shot)
Visual Question Answering (VQA)VQA v2 valAccuracy50.1BLIP-2 ViT-L OPT 2.7B (zero-shot)
Visual Question Answering (VQA)PMC-VQAAccuracy24.3BLIP-2
Visual Question Answering (VQA)InfiMM-EvalAbductive18.96BLIP-2-OPT2.7B
Visual Question Answering (VQA)InfiMM-EvalAnalogical7.5BLIP-2-OPT2.7B
Visual Question Answering (VQA)InfiMM-EvalDeductive2.76BLIP-2-OPT2.7B
Visual Question Answering (VQA)InfiMM-EvalOverall score19.31BLIP-2-OPT2.7B
Visual Question Answering (VQA)GQA test-devAccuracy44.7BLIP-2 ViT-G FlanT5 XXL (zero-shot)
Visual Question Answering (VQA)GQA test-devAccuracy44.4BLIP-2 ViT-L FlanT5 XL (zero-shot)
Visual Question Answering (VQA)GQA test-devAccuracy44.2BLIP-2 ViT-G FlanT5 XL (zero-shot)
Visual Question Answering (VQA)GQA test-devAccuracy36.4BLIP-2 ViT-G OPT 6.7B (zero-shot)
Visual Question Answering (VQA)GQA test-devAccuracy34.6BLIP-2 ViT-G OPT 2.7B (zero-shot)
Visual Question Answering (VQA)GQA test-devAccuracy33.9BLIP-2 ViT-L OPT 2.7B (zero-shot)
Visual Question Answering (VQA)VQA v2 test-devAccuracy65BLIP-2 ViT-G FlanT5 XXL (zero-shot)
Visual Question Answering (VQA)VQA v2 test-devAccuracy63BLIP-2 ViT-G FlanT5 XL (zero-shot)
Visual Question Answering (VQA)VQA v2 test-devAccuracy62.3BLIP-2 ViT-L FlanT5 XL (zero-shot)
Visual Question Answering (VQA)VQA v2 test-devAccuracy52.6BLIP-2 ViT-G OPT 6.7B (zero-shot)
Visual Question Answering (VQA)VQA v2 test-devAccuracy52.3BLIP-2 ViT-G OPT 2.7B (zero-shot)
Visual Question Answering (VQA)VQA v2 test-devAccuracy49.7BLIP-2 ViT-L OPT 2.7B (zero-shot)
Visual Question Answering (VQA)VQA v2 test-devAccuracy82.3BLIP-2 ViT-G OPT 6.7B (fine-tuned)
Visual Question Answering (VQA)VQA v2 test-devAccuracy81.74BLIP-2 ViT-G OPT 2.7B (fine-tuned)
Visual Question Answering (VQA)VQA v2 test-devAccuracy81.66BLIP-2 ViT-G FlanT5 XL (fine-tuned)
Visual Question Answering (VQA)VQA v2 valAccuracy82.19BLIP-2 ViT-G OPT 6.7B (fine-tuned)
Visual Question Answering (VQA)VQA v2 valAccuracy81.59BLIP-2 ViT-G OPT 2.7B (fine-tuned)
Visual Question Answering (VQA)VQA v2 valAccuracy81.55BLIP-2 ViT-G FlanT5 XL (fine-tuned)
Visual Question Answering (VQA)PMC-VQABLEU-17.6BLIP-2
Image Captioningnocaps-val-out-domainCIDEr124.8BLIP-2 ViT-G FlanT5 XL (zero-shot)
Image Captioningnocaps-val-out-domainSPICE15.1BLIP-2 ViT-G FlanT5 XL (zero-shot)
Image Captioningnocaps-val-out-domainCIDEr124.4BLIP-2 ViT-G OPT 6.7B (zero-shot)
Image Captioningnocaps-val-out-domainSPICE14.8BLIP-2 ViT-G OPT 6.7B (zero-shot)
Image Captioningnocaps-val-out-domainCIDEr123.4BLIP-2 ViT-G OPT 2.7B (zero-shot)
Image Captioningnocaps-val-out-domainSPICE15.1BLIP-2 ViT-G OPT 2.7B (zero-shot)
Image Captioningnocaps-val-near-domainCIDEr120.2BLIP-2 ViT-G FlanT5 XL (zero-shot)
Image Captioningnocaps-val-near-domainSPICE15.9BLIP-2 ViT-G FlanT5 XL (zero-shot)
Image Captioningnocaps-val-near-domainCIDEr119.2BLIP-2 ViT-G OPT 6.7B (zero-shot)
Image Captioningnocaps-val-near-domainSPICE15.3BLIP-2 ViT-G OPT 6.7B (zero-shot)
Image Captioningnocaps-val-near-domainCIDEr117.8BLIP-2 ViT-G OPT 2.7B (zero-shot)
Image Captioningnocaps-val-near-domainSPICE15.4BLIP-2 ViT-G OPT 2.7B (zero-shot)
Image CaptioningCOCO CaptionsBLEU-443.7BLIP-2 ViT-G OPT 2.7B (zero-shot)
Image CaptioningCOCO CaptionsCIDER145.8BLIP-2 ViT-G OPT 2.7B (zero-shot)
Image CaptioningCOCO CaptionsBLEU-443.5BLIP-2 ViT-G OPT 6.7B (zero-shot)
Image CaptioningCOCO CaptionsCIDER145.2BLIP-2 ViT-G OPT 6.7B (zero-shot)
Image CaptioningCOCO CaptionsBLEU-442.4BLIP-2 ViT-G FlanT5 XL (zero-shot)
Image CaptioningCOCO CaptionsCIDER144.5BLIP-2 ViT-G FlanT5 XL (zero-shot)
Image Captioningnocaps-val-overallCIDEr121.6BLIP-2 ViT-G FlanT5 XL (zero-shot)
Image Captioningnocaps-val-overallSPICE15.8BLIP-2 ViT-G FlanT5 XL (zero-shot)
Image Captioningnocaps-val-overallCIDEr121BLIP-2 ViT-G OPT 6.7B (zero-shot)
Image Captioningnocaps-val-overallSPICE15.3BLIP-2 ViT-G OPT 6.7B (zero-shot)
Image Captioningnocaps-val-overallCIDEr119.7BLIP-2 ViT-G OPT 2.7B (zero-shot)
Image Captioningnocaps-val-overallSPICE15.4BLIP-2 ViT-G OPT 2.7B (zero-shot)
Image Captioningnocaps-val-in-domainCIDEr123.7BLIP-2 ViT-G FlanT5 XL (zero-shot)
Image Captioningnocaps-val-in-domainSPICE16.3BLIP-2 ViT-G FlanT5 XL (zero-shot)
Image Captioningnocaps-val-in-domainCIDEr123.7BLIP-2 ViT-G OPT 6.7B (zero-shot)
Image Captioningnocaps-val-in-domainSPICE15.8BLIP-2 ViT-G OPT 6.7B (zero-shot)
Image Captioningnocaps-val-in-domainCIDEr123BLIP-2 ViT-G OPT 2.7B (zero-shot)
Image Captioningnocaps-val-in-domainSPICE15.8BLIP-2 ViT-G OPT 2.7B (zero-shot)
Image RetrievalFlickr30kRecall@189.7BLIP-2 ViT-G (zero-shot, 1K test set)
Image RetrievalFlickr30kRecall@1098.9BLIP-2 ViT-G (zero-shot, 1K test set)
Image RetrievalFlickr30kRecall@598.1BLIP-2 ViT-G (zero-shot, 1K test set)
Image RetrievalFlickr30kRecall@188.6BLIP-2 ViT-L (zero-shot, 1K test set)
Image RetrievalFlickr30kRecall@1098.9BLIP-2 ViT-L (zero-shot, 1K test set)
Image RetrievalFlickr30kRecall@597.6BLIP-2 ViT-L (zero-shot, 1K test set)
Image RetrievalCOCO (Common Objects in Context)Recall@1092.6BLIP-2 ViT-G (fine-tuned)
Image RetrievalCOCO (Common Objects in Context)recall@168.3BLIP-2 ViT-G (fine-tuned)
Image RetrievalCOCO (Common Objects in Context)recall@587.7BLIP-2 ViT-G (fine-tuned)
Image RetrievalCOCO (Common Objects in Context)Recall@1091.8BLIP-2 ViT-L (fine-tuned)
Image RetrievalCOCO (Common Objects in Context)recall@166.3BLIP-2 ViT-L (fine-tuned)
Image RetrievalCOCO (Common Objects in Context)recall@586.5BLIP-2 ViT-L (fine-tuned)
Object DetectionOVAD-Box benchmarkmean average precision25.5BLIP 2 (pretrained)
3DOVAD-Box benchmarkmean average precision25.5BLIP 2 (pretrained)
2D ClassificationOVAD-Box benchmarkmean average precision25.5BLIP 2 (pretrained)
2D Object DetectionOVAD-Box benchmarkmean average precision25.5BLIP 2 (pretrained)
Instruction FollowingLLaVA-Benchavg score38.1BLIP-2
Open Vocabulary Object DetectionOVAD-Box benchmarkmean average precision25.5BLIP 2 (pretrained)
Generative Visual Question AnsweringPMC-VQABLEU-17.6BLIP-2
Visual Question AnsweringVQA v2 test-devAccuracy82.3BLIP-2 ViT-G OPT 6.7B (fine-tuned)
Visual Question AnsweringVQA v2 test-devAccuracy81.74BLIP-2 ViT-G OPT 2.7B (fine-tuned)
Visual Question AnsweringVQA v2 test-devAccuracy81.66BLIP-2 ViT-G FlanT5 XL (fine-tuned)
Visual Question AnsweringVQA v2 valAccuracy82.19BLIP-2 ViT-G OPT 6.7B (fine-tuned)
Visual Question AnsweringVQA v2 valAccuracy81.59BLIP-2 ViT-G OPT 2.7B (fine-tuned)
Visual Question AnsweringVQA v2 valAccuracy81.55BLIP-2 ViT-G FlanT5 XL (fine-tuned)
Image-to-Text RetrievalFlickr30kRecall@197.6BLIP-2 ViT-G (zero-shot, 1K test set)
Image-to-Text RetrievalFlickr30kRecall@10100BLIP-2 ViT-G (zero-shot, 1K test set)
Image-to-Text RetrievalFlickr30kRecall@5100BLIP-2 ViT-G (zero-shot, 1K test set)
Image-to-Text RetrievalFlickr30kRecall@196.9BLIP-2 ViT-L (zero-shot, 1K test set)
Image-to-Text RetrievalFlickr30kRecall@10100BLIP-2 ViT-L (zero-shot, 1K test set)
Image-to-Text RetrievalFlickr30kRecall@5100BLIP-2 ViT-L (zero-shot, 1K test set)
Image-to-Text RetrievalCOCO (Common Objects in Context)Recall@185.4BLIP-2 (ViT-G, fine-tuned)
Image-to-Text RetrievalCOCO (Common Objects in Context)Recall@1098.5BLIP-2 (ViT-G, fine-tuned)
Image-to-Text RetrievalCOCO (Common Objects in Context)Recall@597BLIP-2 (ViT-G, fine-tuned)
Image-to-Text RetrievalCOCO (Common Objects in Context)Recall@183.5BLIP-2 (ViT-L, fine-tuned)
Image-to-Text RetrievalCOCO (Common Objects in Context)Recall@1098BLIP-2 (ViT-L, fine-tuned)
Image-to-Text RetrievalCOCO (Common Objects in Context)Recall@596BLIP-2 (ViT-L, fine-tuned)
16kOVAD-Box benchmarkmean average precision25.5BLIP 2 (pretrained)

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21Touch in the Wild: Learning Fine-Grained Manipulation with a Portable Visuo-Tactile Gripper2025-07-20Making Language Model a Hierarchical Classifier and Generator2025-07-17Spectral Bellman Method: Unifying Representation and Exploration in RL2025-07-17Boosting Team Modeling through Tempo-Relational Representation Learning2025-07-17LaViPlan : Language-Guided Visual Path Planning with RLVR2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations2025-07-17