TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Models/BLIP

BLIP

Reported on 21 benchmarks across 12 tasks · 4 papers · 2 SOTA

Note: results are matched by exact model name. Different papers may use the same name for different model variants.

Computer Vision9 results

  • Image RetrievalonMSCOCO
    Recall@1· 2023-01-11
    57.32
    best: 58.46 (HADA)
    HADA: A Graph-based Amalgamation Framework in Image-text RetrievalarXiv:2301.04742
  • Image RetrievalonMSCOCO
    Recall@10· 2023-01-11
    88.92
    best: 89.66 (HADA)
    HADA: A Graph-based Amalgamation Framework in Image-text RetrievalarXiv:2301.04742
  • Image RetrievalonMSCOCO
    Recall@5· 2023-01-11
    81.84
    best: 82.85 (HADA)
    HADA: A Graph-based Amalgamation Framework in Image-text RetrievalarXiv:2301.04742
  • Object DetectiononOVAD-Box benchmark
    mean average precision· uses extra data· 2022-01-28
    24.3
    best: 28 (X-VLM)
    BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and GenerationarXiv:2201.12086
  • Open Vocabulary Object DetectiononOVAD-Box benchmark
    mean average precision· uses extra data· 2022-01-28
    24.3
    best: 28 (X-VLM)
    BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and GenerationarXiv:2201.12086
  • Image RetrievalonConQA Conceptual
    R-precision
    5.4
    best: 6.8 (CLIP)
  • Image RetrievalonConQA Conceptual
    Recall@1
    4.1
    best: 12.2 (CLIP)
  • Image RetrievalonConQA Conceptual
    Recall@10
    40.8
  • Image RetrievalonConQA Conceptual
    Recall@5
    28.6
    best: 30.6 (CLIP)

Methodology4 results

  • 3DonOVAD-Box benchmark
    mean average precision· uses extra data· 2022-01-28
    24.3
    best: 28 (X-VLM)
    BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and GenerationarXiv:2201.12086
  • 2D ClassificationonOVAD-Box benchmark
    mean average precision· uses extra data· 2022-01-28
    24.3
    best: 28 (X-VLM)
    BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and GenerationarXiv:2201.12086
  • 2D Object DetectiononOVAD-Box benchmark
    mean average precision· uses extra data· 2022-01-28
    24.3
    best: 28 (X-VLM)
    BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and GenerationarXiv:2201.12086
  • 16konOVAD-Box benchmark
    mean average precision· uses extra data· 2022-01-28
    24.3
    best: 28 (X-VLM)
    BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and GenerationarXiv:2201.12086

Natural Language Processing3 results

  • Visual Question Answering (VQA)onOVAD benchmark
    Contains w. Synonyms· 2024-02-11
    45.7
    SOTA
    Open-ended VQA benchmarking of Vision-Language models by exploiting Classification datasets and their semantic hierarchyarXiv:2402.07270
  • Visual Question Answering (VQA)onOVAD benchmark
    ExactMatch w. Synonyms· 2024-02-11
    36.99
    SOTA
    Open-ended VQA benchmarking of Vision-Language models by exploiting Classification datasets and their semantic hierarchyarXiv:2402.07270
  • Cross-Modal RetrievalonCommercialAdsDataset
    ADD(S) AUC· 2022-01-28
    83.51
    best: 91.73 (AlignCMSS)
    BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and GenerationarXiv:2201.12086

Reasoning3 results

  • Visual ReasoningonWinoground
    Group Score· 2023-05-10
    15
    best: 58.75 (GPT-4V (CoT, pick b/w two options))
    Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene GraphsarXiv:2305.06343
  • Visual ReasoningonWinoground
    Image Score· 2023-05-10
    19.2
    best: 68.75 (GPT-4V (CoT, pick b/w two options))
    Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene GraphsarXiv:2305.06343
  • Visual ReasoningonWinoground
    Text Score· 2023-05-10
    39
    best: 75.5 (GPT-4o + CA)
    Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene GraphsarXiv:2305.06343

Miscellaneous2 results

  • Image Retrieval with Multi-Modal QueryonCommercialAdsDataset
    ADD(S) AUC· 2022-01-28
    83.51
    best: 91.73 (AlignCMSS)
    BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and GenerationarXiv:2201.12086
  • Cross-Modal Information RetrievalonCommercialAdsDataset
    ADD(S) AUC· 2022-01-28
    83.51
    best: 91.73 (AlignCMSS)
    BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and GenerationarXiv:2201.12086