Tasks
SotA
Datasets
Papers
Methods
Submit
About
SotA
/
Natural Language Processing
/
Image-to-Text Retrieval
/
Flickr30k
Image-to-Text Retrieval on Flickr30k
Metric: Recall@5 (higher is better)
Leaderboard
Dataset
Loading chart...
Results
Submit a result
Export CSV
Sort:
Recall@5 (best first)
Recall@5 (worst first)
Date (newest first)
Date (oldest first)
Model name (A→Z)
#
Model
↕
Recall@5
▼
Extra Data
Paper
Date
↕
Code
1
InternVL-G-FT (finetuned, w/o ranking)
100
No
InternVL: Scaling up Vision Foundation Models an...
2023-12-21
Code
2
BLIP-2 ViT-G (zero-shot, 1K test set)
100
No
BLIP-2: Bootstrapping Language-Image Pre-trainin...
2023-01-30
Code
3
ONE-PEACE (finetuned, w/o ranking)
100
No
ONE-PEACE: Exploring One General Representation ...
2023-05-18
Code
4
InternVL-C-FT (finetuned, w/o ranking)
100
No
InternVL: Scaling up Vision Foundation Models an...
2023-12-21
Code
5
BLIP-2 ViT-L (zero-shot, 1K test set)
100
No
BLIP-2: Bootstrapping Language-Image Pre-trainin...
2023-01-30
Code
6
ERNIE-ViL 2.0
99.9
No
ERNIE-ViL 2.0: Multi-view Contrastive Learning f...
2022-09-30
Code
7
ALBEF
99.8
No
Align before Fuse: Vision and Language Represent...
2021-07-16
Code
8
ALBEF
99.3
No
HADA: A Graph-based Amalgamation Framework in Im...
2023-01-11
Code
9
UNITER
98
No
HADA: A Graph-based Amalgamation Framework in Im...
2023-01-11
Code
10
GSMN
94.3
No
A Deep Local and Global Scene-Graph Matching for...
2021-06-04
Code
11
LGSGM
91.9
No
A Deep Local and Global Scene-Graph Matching for...
2021-06-04
Code
#1
InternVL-G-FT (finetuned, w/o ranking)
100
Recall@5
· 2023-12-21
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
Code
#2
BLIP-2 ViT-G (zero-shot, 1K test set)
SOTA
100
Recall@5
· 2023-01-30
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Code
#3
ONE-PEACE (finetuned, w/o ranking)
100
Recall@5
· 2023-05-18
ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities
Code
#4
InternVL-C-FT (finetuned, w/o ranking)
100
Recall@5
· 2023-12-21
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
Code
#5
BLIP-2 ViT-L (zero-shot, 1K test set)
100
Recall@5
· 2023-01-30
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Code
#6
ERNIE-ViL 2.0
SOTA
99.9
Recall@5
· 2022-09-30
ERNIE-ViL 2.0: Multi-view Contrastive Learning for Image-Text Pre-training
Code
#7
ALBEF
SOTA
99.8
Recall@5
· 2021-07-16
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation
Code
#8
ALBEF
99.3
Recall@5
· 2023-01-11
HADA: A Graph-based Amalgamation Framework in Image-text Retrieval
Code
#9
UNITER
98
Recall@5
· 2023-01-11
HADA: A Graph-based Amalgamation Framework in Image-text Retrieval
Code
#10
GSMN
SOTA
94.3
Recall@5
· 2021-06-04
A Deep Local and Global Scene-Graph Matching for Image-Text Retrieval
Code
#11
LGSGM
91.9
Recall@5
· 2021-06-04
A Deep Local and Global Scene-Graph Matching for Image-Text Retrieval
Code