TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

SotA/Miscellaneous/Cross-Modal Information Retrieval/COCO 2014

Cross-Modal Information Retrieval on COCO 2014

Metric: Image-to-text R@5 (higher is better)

LeaderboardDataset
Loading chart...

Results

Submit a result
#Model↕Image-to-text R@5▼Extra DataPaperDate↕Code
1X2-VLM (large)96.5YesX$^2$-VLM: All-In-One Pre-trained Model For Visi...2022-11-22Code
2BEiT-396.5YesImage as a Foreign Language: BEiT Pretraining fo...2022-08-22Code
3XFM (base)96.4YesToward Building General Foundation Models for La...2023-01-12Code
4X2-VLM (base)96.3YesX$^2$-VLM: All-In-One Pre-trained Model For Visi...2022-11-22Code
5PTP-BLIP (14M)95.9YesPosition-guided Text Prompt for Vision-Language ...2022-12-19Code
6OmniVL (14M)95.9YesOmniVL:One Foundation Model for Image-Language a...2022-09-15-
7VSE-Gradient95.6YesDissecting Deep Metric Learning Losses for Image...2022-10-21Code
8X-VLM (base)95.6YesMulti-Grained Vision Language Pre-Training: Alig...2021-11-16Code
9Aurora (ours, r=128)95.3Yes---
10Florence95.2YesFlorence: A New Foundation Model for Computer Vi...2021-11-22Code
11VK-OOD95.1Yes--Code
12ALBEF94.3YesAlign before Fuse: Vision and Language Represent...2021-07-16Code
13ERNIE-ViL 2.093.6YesERNIE-ViL 2.0: Multi-view Contrastive Learning f...2022-09-30Code
14ALIGN93.5YesScaling Up Visual and Vision-Language Representa...2021-02-11Code
15METER93.16YesAn Empirical Study of Training End-to-End Vision...2021-11-03Code
16TCL92.8YesVision-Language Pre-Training with Triple Contras...2022-02-21Code
17Oscar92.2YesOscar: Object-Semantics Aligned Pre-training for...2020-04-13Code
183SHNet90.5No3SHNet: Boosting Image-Sentence Retrieval via Vi...2024-04-26Code
19ViSTA90.1YesViSTA: Vision and Scene Text Aggregation for Cro...2022-03-31-
20MaMMUT (ours)89.1NoMaMMUT: A Simple Architecture for Joint Learning...2023-03-29Code
21ALADIN88.6YesALADIN: Distilling Fine-grained Alignment Scores...2022-07-29Code
22ViLT-B/3286.3YesViLT: Vision-and-Language Transformer Without Co...2021-02-05Code
23RCAR86.1NoPlug-and-Play Regulators for Image-Text Matching2023-03-23Code
24SGRAF84.9NoSimilarity Reasoning and Filtration for Image-Te...2021-01-05Code
25IMRAM83.2NoIMRAM: Iterative Matching with Recurrent Attenti...2020-03-08Code
26LILE82.4NoLILE: Look In-Depth before Looking Elsewhere -- ...2022-03-02-
27SCAN82.2NoStacked Cross Attention for Image-Text Matching2018-03-21Code
28VSRN81.1NoVisual Semantic Reasoning for Image-Text Matching2019-09-06Code
29DSMD75.6NoDynamic Self-adaptive Multiscale Distillation fr...2024-04-16Code
30PVSE74.3NoPolysemous Visual-Semantic Embedding for Cross-M...2019-06-11Code
31SCO (ResNet)72.3NoLearning Semantic Concepts and Order for Image a...2017-12-06-
32Dual-Path (ResNet)70.5NoDeep Visual-Semantic Alignments for Generating I...2014-12-07Code