TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/BLIP: Bootstrapping Language-Image Pre-training for Unifie...

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

Junnan Li, Dongxu Li, Caiming Xiong, Steven Hoi

2022-01-28Image-text RetrievalOpen Vocabulary Attribute DetectionImage-text matchingText RetrievalImage CaptioningVisual ReasoningRetrievalVisual Question Answering (VQA)
PaperPDFCode(official)CodeCodeCodeCodeCodeCodeCodeCode

Abstract

Vision-Language Pre-training (VLP) has advanced the performance for many vision-language tasks. However, most existing pre-trained models only excel in either understanding-based tasks or generation-based tasks. Furthermore, performance improvement has been largely achieved by scaling up the dataset with noisy image-text pairs collected from the web, which is a suboptimal source of supervision. In this paper, we propose BLIP, a new VLP framework which transfers flexibly to both vision-language understanding and generation tasks. BLIP effectively utilizes the noisy web data by bootstrapping the captions, where a captioner generates synthetic captions and a filter removes the noisy ones. We achieve state-of-the-art results on a wide range of vision-language tasks, such as image-text retrieval (+2.7% in average recall@1), image captioning (+2.8% in CIDEr), and VQA (+1.6% in VQA score). BLIP also demonstrates strong generalization ability when directly transferred to video-language tasks in a zero-shot manner. Code, models, and datasets are released at https://github.com/salesforce/BLIP.

Results

TaskDatasetMetricValueModel
Visual ReasoningNLVR2 TestAccuracy83.09BLIP-129M
Image Captioningnocaps-val-out-domainCIDEr115.3BLIP_ViT-L
Image Captioningnocaps-val-out-domainSPICE14.4BLIP_ViT-L
Image Captioningnocaps-val-out-domainCIDEr111.5BLIP_CapFilt-L
Image Captioningnocaps-val-out-domainSPICE14.2BLIP_CapFilt-L
Image Captioningnocaps-val-near-domainCIDEr112.1BLIP_ViT-L
Image Captioningnocaps-val-near-domainSPICE14.9BLIP_ViT-L
Image Captioningnocaps-val-near-domainCIDEr108.6BLIP_CapFilt-L
Image Captioningnocaps-val-near-domainSPICE14.8BLIP_CapFilt-L
Image Captioningnocaps-val-overallCIDEr113.2BLIP_ViT-L
Image Captioningnocaps-val-overallSPICE14.8BLIP_ViT-L
Image Captioningnocaps-val-overallCIDEr109.6BLIP_CapFilt-L
Image Captioningnocaps-val-overallSPICE14.7BLIP_CapFilt-L
Image Captioningnocaps-val-in-domainCIDEr114.9BLIP_ViT-L
Image Captioningnocaps-val-in-domainSPICE15.2BLIP_ViT-L
Image Captioningnocaps-val-in-domainCIDEr111.8BLIP_CapFilt-L
Image Captioningnocaps-val-in-domainSPICE14.9BLIP_CapFilt-L
Image Retrieval with Multi-Modal QueryCommercialAdsDatasetADD(S) AUC83.51BLIP
Object DetectionOVAD-Box benchmarkmean average precision24.3BLIP
3DOVAD-Box benchmarkmean average precision24.3BLIP
2D ClassificationOVAD-Box benchmarkmean average precision24.3BLIP
2D Object DetectionOVAD-Box benchmarkmean average precision24.3BLIP
Cross-Modal Information RetrievalCommercialAdsDatasetADD(S) AUC83.51BLIP
Open Vocabulary Object DetectionOVAD-Box benchmarkmean average precision24.3BLIP
Cross-Modal RetrievalCommercialAdsDatasetADD(S) AUC83.51BLIP
16kOVAD-Box benchmarkmean average precision24.3BLIP

Related Papers

LaViPlan : Language-Guided Visual Path Planning with RLVR2025-07-17From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16Developing Visual Augmented Q&A System using Scalable Vision Embedding Retrieval & Late Interaction Re-ranker2025-07-16