TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Methods/BLIP

BLIP

BLIP: Bootstrapping Language-Image Pre-training

Computer VisionIntroduced 200093 papers
Source Paper

Description

Vision-Language Pre-training (VLP) has advanced the performance for many vision-language tasks. However, most existing pre-trained models only excel in either understanding-based tasks or generation-based tasks. Furthermore, performance improvement has been largely achieved by scaling up the dataset with noisy image-text pairs collected from the web, which is a suboptimal source of supervision. In this paper, we propose BLIP, a new VLP framework which transfers flexibly to both vision-language understanding and generation tasks. BLIP effectively utilizes the noisy web data by bootstrapping the captions, where a captioner generates synthetic captions and a filter removes the noisy ones. We achieve state-of-the-art results on a wide range of vision-language tasks, such as image-text retrieval (+2.7% in average recall@1), image captioning (+2.8% in CIDEr), and VQA (+1.6% in VQA score). BLIP also demonstrates strong generalization ability when directly transferred to video-language tasks in a zero-shot manner. Code, models, and datasets are released at https://github.com/salesforce/BLIP.

Papers Using This Method

Text-Visual Semantic Constrained AI-Generated Image Quality Assessment2025-07-14VisText-Mosquito: A Multimodal Dataset and Benchmark for AI-Based Mosquito Breeding Site Detection and Reasoning2025-06-17Fusing Cross-modal and Uni-modal Representations: A Kronecker Product Approach2025-06-10A Narrative Review on Large AI Models in Lung Cancer Screening, Diagnosis, and Treatment Planning2025-06-08When VLMs Meet Image Classification: Test Sets Renovation via Missing Label Identification2025-05-22MedBLIP: Fine-tuning BLIP for Medical Image Captioning2025-05-20From Complexity to Clarity: Transforming Chest X-ray Reports with Chained Prompting (Student Abstract)2025-04-11From Complexity to Clarity: Transforming Chest X-ray Reports with Chained Prompting (Student Abstract) Authors2025-04-11Learning Sparse Disentangled Representations for Multimodal Exclusion Retrieval2025-04-04OMR-Diffusion:Optimizing Multi-Round Enhanced Training in Diffusion Models for Improved Intent Understanding2025-03-22TDRI: Two-Phase Dialogue Refinement and Co-Adaptation for Interactive Image Generation2025-03-22Are Large Language Models Good Data Preprocessors?2025-02-24NanoVLMs: How small can we go and still make coherent Vision Language Models?2025-02-11An Evaluation Framework for Product Images Background Inpainting based on Human Feedback and Product Consistency2024-12-23Barking Up The Syntactic Tree: Enhancing VLM Training with Syntactic Losses2024-12-11Attacks on multimodal models2024-12-02Understanding the World's Museums through Vision-Language Reasoning2024-12-02Nearest Neighbor Normalization Improves Multimodal Retrieval2024-10-31Technical Report for Soccernet 2023 -- Dense Video Captioning2024-10-31EfficientEQA: An Efficient Approach for Open Vocabulary Embodied Question Answering2024-10-26