TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

SotA/Miscellaneous/Image Retrieval with Multi-Modal Query/COCO 2014

Image Retrieval with Multi-Modal Query on COCO 2014

Metric: Image-to-text R@1 (higher is better)

LeaderboardDataset
Loading chart...

Results

Submit a result
#Model↕Image-to-text R@1▼Extra DataPaperDate↕Code
1BEiT-384.8YesImage as a Foreign Language: BEiT Pretraining fo...2022-08-22Code
2X2-VLM (large)84.4YesX$^2$-VLM: All-In-One Pre-trained Model For Visi...2022-11-22Code
3XFM (base)84.2YesToward Building General Foundation Models for La...2023-01-12Code
4X2-VLM (base)83.5YesX$^2$-VLM: All-In-One Pre-trained Model For Visi...2022-11-22Code
5OmniVL (14M)82.1YesOmniVL:One Foundation Model for Image-Language a...2022-09-15-
6Florence81.8YesFlorence: A New Foundation Model for Computer Vi...2021-11-22Code
7PTP-BLIP (14M)81.5YesPosition-guided Text Prompt for Vision-Language ...2022-12-19Code
8VSE-Gradient81.4YesDissecting Deep Metric Learning Losses for Image...2022-10-21Code
9X-VLM (base)81.2YesMulti-Grained Vision Language Pre-Training: Alig...2021-11-16Code
10VK-OOD80.7Yes--Code
11Aurora (ours, r=128)80.7Yes---
12ALBEF77.6YesAlign before Fuse: Vision and Language Represent...2021-07-16Code
13ERNIE-ViL 2.077.4YesERNIE-ViL 2.0: Multi-view Contrastive Learning f...2022-09-30Code
14ALIGN77YesScaling Up Visual and Vision-Language Representa...2021-02-11Code
15METER76.16YesAn Empirical Study of Training End-to-End Vision...2021-11-03Code
16TCL75.6YesVision-Language Pre-Training with Triple Contras...2022-02-21Code
17InternVL-G74.9NoInternVL: Scaling up Vision Foundation Models an...2023-12-21Code
18Oscar73.5YesOscar: Object-Semantics Aligned Pre-training for...2020-04-13Code
19M2-Encoder72.8NoM2-Encoder: Advancing Bilingual Image-Text Under...2024-01-29Code
20TCL71.4NoVision-Language Pre-Training with Triple Contras...2022-02-21Code
21MaMMUT (ours)70.7NoMaMMUT: A Simple Architecture for Joint Learning...2023-03-29Code
22InternVL-C70.6NoInternVL: Scaling up Vision Foundation Models an...2023-12-21Code
23PTP-BLIP69.7NoPosition-guided Text Prompt for Vision-Language ...2022-12-19Code
24ViSTA68.9YesViSTA: Vision and Scene Text Aggregation for Cro...2022-03-31-
25RO-ViT68.9NoRegion-Aware Pretraining for Open-Vocabulary Obj...2023-05-11Code
26ALBEF68.7NoAlign before Fuse: Vision and Language Represent...2021-07-16Code
27COSMOS ViT-B/1668NoCOSMOS: Cross-Modality Self-Distillation for Vis...2024-12-02Code
283SHNet67.9No3SHNet: Boosting Image-Sentence Retrieval via Vi...2024-04-26Code
29CoCa66.3NoCoCa: Contrastive Captioners are Image-Text Foun...2022-05-04Code
30Flamingo65.9NoFlamingo: a Visual Language Model for Few-Shot L...2022-04-29Code
31ALADIN64.9YesALADIN: Distilling Fine-grained Alignment Scores...2022-07-29Code
32Florence64.7NoFlorence: A New Foundation Model for Computer Vi...2021-11-22Code
33COSMOS ViT-B/3264.3NoCOSMOS: Cross-Modality Self-Distillation for Vis...2024-12-02Code
34ERNIE-ViL 2.063.1NoERNIE-ViL 2.0: Multi-view Contrastive Learning f...2022-09-30Code
35ViLT-B/3261.5YesViLT: Vision-and-Language Transformer Without Co...2021-02-05Code
36RCAR61.3NoPlug-and-Play Regulators for Image-Text Matching2023-03-23Code
37NAPReg59.8No--Code
38ALIGN58.6NoScaling Up Visual and Vision-Language Representa...2021-02-11Code
39CLIP58.4NoLearning Transferable Visual Models From Natural...2021-02-26Code
40SGRAF57.8NoSimilarity Reasoning and Filtration for Image-Te...2021-01-05Code
41ViLT-B/3256.5NoViLT: Vision-and-Language Transformer Without Co...2021-02-05Code
42LILE55.6NoLILE: Look In-Depth before Looking Elsewhere -- ...2022-03-02-
43IMRAM53.7NoIMRAM: Iterative Matching with Recurrent Attenti...2020-03-08Code
44VSRN53NoVisual Semantic Reasoning for Image-Text Matching2019-09-06Code
45SCAN50.4NoStacked Cross Attention for Image-Text Matching2018-03-21Code
46DSMD48NoDynamic Self-adaptive Multiscale Distillation fr...2024-04-16Code
47PVSE45.2NoPolysemous Visual-Semantic Embedding for Cross-M...2019-06-11Code
48ImageBERT44NoImageBERT: Cross-modal Pre-training with Large-s...2020-01-22-
49SCO (ResNet)42.8NoLearning Semantic Concepts and Order for Image a...2017-12-06-
50Dual-Path (ResNet)41.2NoDeep Visual-Semantic Alignments for Generating I...2014-12-07Code
51dfdf0No---