TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

SotA/Miscellaneous/Image Retrieval with Multi-Modal Query/Flickr30k

Image Retrieval with Multi-Modal Query on Flickr30k

Metric: Text-to-image R@1 (higher is better)

LeaderboardDataset
Loading chart...

Results

Submit a result
#Model↕Text-to-image R@1▼Extra DataPaperDate↕Code
1ERNIE-ViL 2.093.3YesERNIE-ViL 2.0: Multi-view Contrastive Learning f...2022-09-30Code
2M2-Encoder92.2YesM2-Encoder: Advancing Bilingual Image-Text Under...2024-01-29Code
3X2-VLM (large)91.8YesX$^2$-VLM: All-In-One Pre-trained Model For Visi...2022-11-22Code
4VAST91YesVAST: A Vision-Audio-Subtitle-Text Omni-Modality...2023-05-29Code
5X2-VLM (base)90.4YesX$^2$-VLM: All-In-One Pre-trained Model For Visi...2022-11-22Code
6VAST90.4YesVAST: A Vision-Audio-Subtitle-Text Omni-Modality...2023-05-29Code
7BEiT-390.3YesImage as a Foreign Language: BEiT Pretraining fo...2022-08-22Code
8OmniVL (14M)87.9YesOmniVL:One Foundation Model for Image-Language a...2022-09-15-
9X-VLM (base)86.9YesMulti-Grained Vision Language Pre-Training: Alig...2021-11-16Code
10Aurora (ours, r=128)86.8Yes---
11VSE-Gradient86.3YesDissecting Deep Metric Learning Losses for Image...2022-10-21Code
12InternVL-G85NoInternVL: Scaling up Vision Foundation Models an...2023-12-21Code
13ALIGN84.9YesScaling Up Visual and Vision-Language Representa...2021-02-11Code
14InternVL-C81.7NoInternVL: Scaling up Vision Foundation Models an...2023-12-21Code
15BEiT-381.5NoImage as a Foreign Language: BEiT Pretraining fo...2022-08-22Code
16RO-ViT80.7NoRegion-Aware Pretraining for Open-Vocabulary Obj...2023-05-11Code
17CoCa80.4NoCoCa: Contrastive Captioners are Image-Text Foun...2022-05-04Code
18COSMOS ViT-B/1680.3NoCOSMOS: Cross-Modality Self-Distillation for Vis...2024-12-02Code
19Flamingo79.5NoFlamingo: a Visual Language Model for Few-Shot L...2022-04-29Code
20ERNIE-ViL 2.077.4NoERNIE-ViL 2.0: Multi-view Contrastive Learning f...2022-09-30Code
21VK-OOD77.2No--Code
22IAIS76.86YesLearning Relation Alignment for Calibrated Cross...2021-05-28Code
23ALBEF76.8NoAlign before Fuse: Vision and Language Represent...2021-07-16Code
24Florence76.7NoFlorence: A New Foundation Model for Computer Vi...2021-11-22Code
25COSMOS ViT-B/3276.1NoCOSMOS: Cross-Modality Self-Distillation for Vis...2024-12-02Code
26ViSTA75.8YesViSTA: Vision and Scene Text Aggregation for Cro...2022-03-31-
27ALIGN75.7NoScaling Up Visual and Vision-Language Representa...2021-02-11Code
28PTP-BLIP (14M)73.1NoPosition-guided Text Prompt for Vision-Language ...2022-12-19Code
29AltCLIP72.5NoAltCLIP: Altering the Language Encoder in CLIP f...2022-11-12Code
303SHNet69.5No3SHNet: Boosting Image-Sentence Retrieval via Vi...2024-04-26Code
31CLIP68.7NoLearning Transferable Visual Models From Natural...2021-02-26Code
32DSMD68.4NoDynamic Self-adaptive Multiscale Distillation fr...2024-04-16Code
33UNITER66.2NoUNITER: UNiversal Image-TExt Representation Lear...2019-09-25Code
34ViLT-B/3264.4YesViLT: Vision-and-Language Transformer Without Co...2021-02-05Code
35RCAR62.6NoPlug-and-Play Regulators for Image-Text Matching2023-03-23Code
36NAPReg60No--Code
37SGRAF58.5NoSimilarity Reasoning and Filtration for Image-Te...2021-01-05Code
38GSMN57.4NoGraph Structured Network for Image-Text Matching2020-04-01Code
39ViLT-B/3255NoViLT: Vision-and-Language Transformer Without Co...2021-02-05Code
40Pearl54.98No---
41ImageBERT54.3NoImageBERT: Cross-modal Pre-training with Large-s...2020-01-22-
42IMRAM53.9NoIMRAM: Iterative Matching with Recurrent Attenti...2020-03-08Code
43SCAN48.6NoStacked Cross Attention for Image-Text Matching2018-03-21Code
44SCO (ResNet)41.1NoLearning Semantic Concepts and Order for Image a...2017-12-06-
45VSE++ (ResNet)39.6NoVSE++: Improving Visual-Semantic Embeddings with...2017-07-18Code
46Dual-Path (ResNet)39.1NoDual-Path Convolutional Image-Text Embeddings wi...2017-11-15Code
47CMPL (ResNet)37.3No--Code