TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

SotA/Natural Language Processing/Cross-Modal Retrieval/COCO 2014

Cross-Modal Retrieval on COCO 2014

Metric: Text-to-image R@5 (higher is better)

LeaderboardDataset
Loading chart...

Results

Submit a result
#Model↕Text-to-image R@5▼Extra DataPaperDate↕Code
1BEiT-392.8YesImage as a Foreign Language: BEiT Pretraining fo...2022-08-22Code
2VAST87.7YesVAST: A Vision-Audio-Subtitle-Text Omni-Modality...2023-05-29Code
3X2-VLM (large)87.5YesX$^2$-VLM: All-In-One Pre-trained Model For Visi...2022-11-22Code
4PTP-BLIP (14M)87.4YesPosition-guided Text Prompt for Vision-Language ...2022-12-19Code
5XFM (base)87.2YesToward Building General Foundation Models for La...2023-01-12Code
6X2-VLM (base)87.1YesX$^2$-VLM: All-In-One Pre-trained Model For Visi...2022-11-22Code
7OmniVL (14M)86.1YesOmniVL:One Foundation Model for Image-Language a...2022-09-15-
8VSE-Gradient86YesDissecting Deep Metric Learning Losses for Image...2022-10-21Code
9DSMD85.9NoDynamic Self-adaptive Multiscale Distillation fr...2024-04-16Code
10X-VLM (base)85.8YesMulti-Grained Vision Language Pre-Training: Alig...2021-11-16Code
11Florence85.7YesFlorence: A New Foundation Model for Computer Vi...2021-11-22Code
12VK-OOD84.8Yes--Code
13Aurora (ours, r=128)84.8Yes---
14VALOR84.4YesVALOR: Vision-Audio-Language Omni-Perception Pre...2023-04-17Code
15ALBEF84.3YesAlign before Fuse: Vision and Language Represent...2021-07-16Code
16ERNIE-ViL 2.083.4YesERNIE-ViL 2.0: Multi-view Contrastive Learning f...2022-09-30Code
17ALIGN83.3YesScaling Up Visual and Vision-Language Representa...2021-02-11Code
18TCL83.2YesLearning to Generate Text-grounded Mask for Open...2022-12-01Code
19Oscar82.8YesOscar: Object-Semantics Aligned Pre-training for...2020-04-13Code
20METER82.66YesAn Empirical Study of Training End-to-End Vision...2021-11-03Code
21ViSTA79.6YesViSTA: Vision and Scene Text Aggregation for Cro...2022-03-31-
223SHNet79.3No3SHNet: Boosting Image-Sentence Retrieval via Vi...2024-04-26Code
23ALADIN79.2YesALADIN: Distilling Fine-grained Alignment Scores...2022-07-29Code
24RCAR73.2NoPlug-and-Play Regulators for Image-Text Matching2023-03-23Code
25ViLT-B/3272.9YesViLT: Vision-and-Language Transformer Without Co...2021-02-05Code
26VisualSparta72.8YesVisualSparta: An Embarrassingly Simple Approach ...2021-01-01Code
27LILE72.1NoLILE: Look In-Depth before Looking Elsewhere -- ...2022-03-02-
28SGRAF70.7NoSimilarity Reasoning and Filtration for Image-Te...2021-01-05Code
29VSRN70.6NoVisual Semantic Reasoning for Image-Text Matching2019-09-06Code
30SCAN69.3NoStacked Cross Attention for Image-Text Matching2018-03-21Code
31IMRAM69.1NoIMRAM: Iterative Matching with Recurrent Attenti...2020-03-08Code
32PVSE63NoPolysemous Visual-Semantic Embedding for Cross-M...2019-06-11Code
33SCO (ResNet)62.9NoLearning Semantic Concepts and Order for Image a...2017-12-06-
34Dual-Path (ResNet)53.4NoDeep Visual-Semantic Alignments for Generating I...2014-12-07Code