TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

SotA/Miscellaneous/Image Retrieval with Multi-Modal Query/COCO 2014

Image Retrieval with Multi-Modal Query on COCO 2014

Metric: Text-to-image R@1 (higher is better)

LeaderboardDataset
Loading chart...

Results

Submit a result
#Model↕Text-to-image R@1▼Extra DataPaperDate↕Code
1VAST68YesVAST: A Vision-Audio-Subtitle-Text Omni-Modality...2023-05-29Code
2X2-VLM (large)67.7YesX$^2$-VLM: All-In-One Pre-trained Model For Visi...2022-11-22Code
3BEiT-367.2YesImage as a Foreign Language: BEiT Pretraining fo...2022-08-22Code
4XFM (base)67YesToward Building General Foundation Models for La...2023-01-12Code
5X2-VLM (base)66.2YesX$^2$-VLM: All-In-One Pre-trained Model For Visi...2022-11-22Code
6PTP-BLIP (14M)64.9YesPosition-guided Text Prompt for Vision-Language ...2022-12-19Code
7OmniVL (14M)64.8YesOmniVL:One Foundation Model for Image-Language a...2022-09-15-
8VSE-Gradient63.6YesDissecting Deep Metric Learning Losses for Image...2022-10-21Code
9X-VLM (base)63.4YesMulti-Grained Vision Language Pre-Training: Alig...2021-11-16Code
10Florence63.2YesFlorence: A New Foundation Model for Computer Vi...2021-11-22Code
11VK-OOD62.9Yes--Code
12Aurora (ours, r=128)62.8Yes---
13DSMD62.1NoDynamic Self-adaptive Multiscale Distillation fr...2024-04-16Code
14VALOR61.4YesVALOR: Vision-Audio-Language Omni-Perception Pre...2023-04-17Code
15ALBEF60.7YesAlign before Fuse: Vision and Language Represent...2021-07-16Code
16ALIGN59.9YesScaling Up Visual and Vision-Language Representa...2021-02-11Code
17ERNIE-ViL 2.059.5YesERNIE-ViL 2.0: Multi-view Contrastive Learning f...2022-09-30Code
18TCL59YesVision-Language Pre-Training with Triple Contras...2022-02-21Code
19InternVL-G58.6NoInternVL: Scaling up Vision Foundation Models an...2023-12-21Code
20Oscar57.5YesOscar: Object-Semantics Aligned Pre-training for...2020-04-13Code
21METER57.08YesAn Empirical Study of Training End-to-End Vision...2021-11-03Code
22M2-Encoder56.5NoM2-Encoder: Advancing Bilingual Image-Text Under...2024-01-29Code
23InternVL-C54.1NoInternVL: Scaling up Vision Foundation Models an...2023-12-21Code
24TCL53.5NoVision-Language Pre-Training with Triple Contras...2022-02-21Code
25ViSTA52.6YesViSTA: Vision and Scene Text Aggregation for Cro...2022-03-31-
26COSMOS ViT-B/1652.5NoCOSMOS: Cross-Modality Self-Distillation for Vis...2024-12-02Code
27RO-ViT51.8NoRegion-Aware Pretraining for Open-Vocabulary Obj...2023-05-11Code
28ALADIN51.3YesALADIN: Distilling Fine-grained Alignment Scores...2022-07-29Code
29CoCa51.2NoCoCa: Contrastive Captioners are Image-Text Foun...2022-05-04Code
303SHNet50.3No3SHNet: Boosting Image-Sentence Retrieval via Vi...2024-04-26Code
31ALBEF50.1NoAlign before Fuse: Vision and Language Represent...2021-07-16Code
32PTP-BLIP49.5NoPosition-guided Text Prompt for Vision-Language ...2022-12-19Code
33COSMOS ViT-B/3248.4NoCOSMOS: Cross-Modality Self-Distillation for Vis...2024-12-02Code
34Flamingo48NoFlamingo: a Visual Language Model for Few-Shot L...2022-04-29Code
35Florence47.2NoFlorence: A New Foundation Model for Computer Vi...2021-11-22Code
36ERNIE-ViL 2.046NoERNIE-ViL 2.0: Multi-view Contrastive Learning f...2022-09-30Code
37ALIGN45.6NoScaling Up Visual and Vision-Language Representa...2021-02-11Code
38VisualSparta44.4YesVisualSparta: An Embarrassingly Simple Approach ...2021-01-01Code
39RCAR44.3NoPlug-and-Play Regulators for Image-Text Matching2023-03-23Code
40NAPReg43No--Code
41ViLT-B/3242.7YesViLT: Vision-and-Language Transformer Without Co...2021-02-05Code
42SGRAF41.9NoSimilarity Reasoning and Filtration for Image-Te...2021-01-05Code
43LILE41.5NoLILE: Look In-Depth before Looking Elsewhere -- ...2022-03-02-
44VSRN40.5NoVisual Semantic Reasoning for Image-Text Matching2019-09-06Code
45ViLT-B/3240.4NoViLT: Vision-and-Language Transformer Without Co...2021-02-05Code
46IMRAM39.7NoIMRAM: Iterative Matching with Recurrent Attenti...2020-03-08Code
47SCAN38.6NoStacked Cross Attention for Image-Text Matching2018-03-21Code
48CLIP37.8NoLearning Transferable Visual Models From Natural...2021-02-26Code
49SCO (ResNet)33.1NoLearning Semantic Concepts and Order for Image a...2017-12-06-
50PVSE32.4NoPolysemous Visual-Semantic Embedding for Cross-M...2019-06-11Code
51ImageBERT32.3NoImageBERT: Cross-modal Pre-training with Large-s...2020-01-22-
52Dual-Path (ResNet)25.3NoDeep Visual-Semantic Alignments for Generating I...2014-12-07Code
53dfdf0No---