Tasks SotA Datasets Papers Methods Submit About

Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable Benchmarks All SotA Datasets Papers Methods

Community

Submit Results About

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Models/BLIP

BLIP

Reported on 21 benchmarks across 12 tasks · 4 papers · 2 SOTA

Note: results are matched by exact model name. Different papers may use the same name for different model variants.

Computer Vision9 results

Image RetrievalonMSCOCO
Recall@1· 2023-01-11
57.32
best: 58.46 (HADA)
HADA: A Graph-based Amalgamation Framework in Image-text Retrieval arXiv:2301.04742
Image RetrievalonMSCOCO
Recall@10· 2023-01-11
88.92
best: 89.66 (HADA)
HADA: A Graph-based Amalgamation Framework in Image-text Retrieval arXiv:2301.04742
Image RetrievalonMSCOCO
Recall@5· 2023-01-11
81.84
best: 82.85 (HADA)
HADA: A Graph-based Amalgamation Framework in Image-text Retrieval arXiv:2301.04742
Object DetectiononOVAD-Box benchmark
mean average precision· uses extra data· 2022-01-28
24.3
best: 28 (X-VLM)
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation arXiv:2201.12086
Open Vocabulary Object DetectiononOVAD-Box benchmark
mean average precision· uses extra data· 2022-01-28
24.3
best: 28 (X-VLM)
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation arXiv:2201.12086
Image RetrievalonConQA Conceptual
R-precision
5.4
best: 6.8 (CLIP)
Image RetrievalonConQA Conceptual
Recall@1
4.1
best: 12.2 (CLIP)
Image RetrievalonConQA Conceptual
Recall@10
40.8
Image RetrievalonConQA Conceptual
Recall@5
28.6
best: 30.6 (CLIP)

Methodology4 results

3DonOVAD-Box benchmark
mean average precision· uses extra data· 2022-01-28
24.3
best: 28 (X-VLM)
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation arXiv:2201.12086
2D ClassificationonOVAD-Box benchmark
mean average precision· uses extra data· 2022-01-28
24.3
best: 28 (X-VLM)
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation arXiv:2201.12086
2D Object DetectiononOVAD-Box benchmark
mean average precision· uses extra data· 2022-01-28
24.3
best: 28 (X-VLM)
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation arXiv:2201.12086
16konOVAD-Box benchmark
mean average precision· uses extra data· 2022-01-28
24.3
best: 28 (X-VLM)
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation arXiv:2201.12086

Natural Language Processing3 results

Visual Question Answering (VQA)onOVAD benchmark
Contains w. Synonyms· 2024-02-11
45.7
SOTA
Open-ended VQA benchmarking of Vision-Language models by exploiting Classification datasets and their semantic hierarchy arXiv:2402.07270
Visual Question Answering (VQA)onOVAD benchmark
ExactMatch w. Synonyms· 2024-02-11
36.99
SOTA
Open-ended VQA benchmarking of Vision-Language models by exploiting Classification datasets and their semantic hierarchy arXiv:2402.07270
Cross-Modal RetrievalonCommercialAdsDataset
ADD(S) AUC· 2022-01-28
83.51
best: 91.73 (AlignCMSS)
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation arXiv:2201.12086

Reasoning3 results

Visual ReasoningonWinoground
Group Score· 2023-05-10
15
best: 58.75 (GPT-4V (CoT, pick b/w two options))
Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs arXiv:2305.06343
Visual ReasoningonWinoground
Image Score· 2023-05-10
19.2
best: 68.75 (GPT-4V (CoT, pick b/w two options))
Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs arXiv:2305.06343
Visual ReasoningonWinoground
Text Score· 2023-05-10
39
best: 75.5 (GPT-4o + CA)
Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs arXiv:2305.06343

Miscellaneous2 results

Image Retrieval with Multi-Modal QueryonCommercialAdsDataset
ADD(S) AUC· 2022-01-28
83.51
best: 91.73 (AlignCMSS)
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation arXiv:2201.12086
Cross-Modal Information RetrievalonCommercialAdsDataset
ADD(S) AUC· 2022-01-28
83.51
best: 91.73 (AlignCMSS)
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation arXiv:2201.12086