TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

SotA/Miscellaneous/Image Retrieval with Multi-Modal Query/Flickr30k

Image Retrieval with Multi-Modal Query on Flickr30k

Metric: Image-to-text R@1 (higher is better)

LeaderboardDataset
Loading chart...

Results

Submit a result
#Model↕Image-to-text R@1▼Extra DataPaperDate↕Code
1X2-VLM (large)98.8YesX$^2$-VLM: All-In-One Pre-trained Model For Visi...2022-11-22Code
2X2-VLM (base)98.5YesX$^2$-VLM: All-In-One Pre-trained Model For Visi...2022-11-22Code
3BEiT-398YesImage as a Foreign Language: BEiT Pretraining fo...2022-08-22Code
4OmniVL (14M)97.3YesOmniVL:One Foundation Model for Image-Language a...2022-09-15-
5ERNIE-ViL 2.097.2YesERNIE-ViL 2.0: Multi-view Contrastive Learning f...2022-09-30Code
6Aurora (ours, r=128)97.2Yes---
7X-VLM (base)97.1YesMulti-Grained Vision Language Pre-Training: Alig...2021-11-16Code
8VSE-Gradient97YesDissecting Deep Metric Learning Losses for Image...2022-10-21Code
9InternVL-G95.7NoInternVL: Scaling up Vision Foundation Models an...2023-12-21Code
10ALIGN95.3YesScaling Up Visual and Vision-Language Representa...2021-02-11Code
11BEiT-394.9NoImage as a Foreign Language: BEiT Pretraining fo...2022-08-22Code
12InternVL-C94.7NoInternVL: Scaling up Vision Foundation Models an...2023-12-21Code
13COSMOS ViT-B/1692.9NoCOSMOS: Cross-Modality Self-Distillation for Vis...2024-12-02Code
14CoCa92.5NoCoCa: Contrastive Captioners are Image-Text Foun...2022-05-04Code
15RO-ViT92.1NoRegion-Aware Pretraining for Open-Vocabulary Obj...2023-05-11Code
16M2-Encoder91.2YesM2-Encoder: Advancing Bilingual Image-Text Under...2024-01-29Code
17ERNIE-ViL 2.091.2NoERNIE-ViL 2.0: Multi-view Contrastive Learning f...2022-09-30Code
18Florence90.9NoFlorence: A New Foundation Model for Computer Vi...2021-11-22Code
19ALBEF90.5NoAlign before Fuse: Vision and Language Represent...2021-07-16Code
20COSMOS ViT-B/3289.9NoCOSMOS: Cross-Modality Self-Distillation for Vis...2024-12-02Code
21ViSTA89.5YesViSTA: Vision and Scene Text Aggregation for Cro...2022-03-31-
22Flamingo89.3NoFlamingo: a Visual Language Model for Few-Shot L...2022-04-29Code
23VK-OOD89No--Code
24ALIGN88.6NoScaling Up Visual and Vision-Language Representa...2021-02-11Code
25IAIS88.3YesLearning Relation Alignment for Calibrated Cross...2021-05-28Code
26CLIP88NoLearning Transferable Visual Models From Natural...2021-02-26Code
273SHNet87.1No3SHNet: Boosting Image-Sentence Retrieval via Vi...2024-04-26Code
28PTP-BLIP (14M)87.1NoPosition-guided Text Prompt for Vision-Language ...2022-12-19Code
29AltCLIP86NoAltCLIP: Altering the Language Encoder in CLIP f...2022-11-12Code
30ViLT-B/3283.5YesViLT: Vision-and-Language Transformer Without Co...2021-02-05Code
31DSMD82.5NoDynamic Self-adaptive Multiscale Distillation fr...2024-04-16Code
32RCAR82.3NoPlug-and-Play Regulators for Image-Text Matching2023-03-23Code
33UNITER80.7NoUNITER: UNiversal Image-TExt Representation Lear...2019-09-25Code
34NAPReg79.6No--Code
35SGRAF77.8NoSimilarity Reasoning and Filtration for Image-Te...2021-01-05Code
36GSMN76.4NoA Deep Local and Global Scene-Graph Matching for...2021-06-04Code
37Pearl75.3No---
38IMRAM74.1NoIMRAM: Iterative Matching with Recurrent Attenti...2020-03-08Code
39ViLT-B/3273.2NoViLT: Vision-and-Language Transformer Without Co...2021-02-05Code
40ImageBERT70.7NoImageBERT: Cross-modal Pre-training with Large-s...2020-01-22-
41SCAN67.4NoStacked Cross Attention for Image-Text Matching2018-03-21Code
42Dual-Path (ResNet)55.6NoDual-Path Convolutional Image-Text Embeddings wi...2017-11-15Code
43SCO (ResNet)55.5NoLearning Semantic Concepts and Order for Image a...2017-12-06-
44VSE++ (ResNet)52.9NoVSE++: Improving Visual-Semantic Embeddings with...2017-07-18Code
45CMPL (ResNet)49.6No--Code