Image-to-Text Retrieval on Flickr30k

Metric: Recall@5 (higher is better)

LeaderboardDataset

Loading chart...

Results

Sort:

#	Model↕	Recall@5▼	Extra Data	Paper	Date↕	Code
1	InternVL-G-FT (finetuned, w/o ranking)	100	No	InternVL: Scaling up Vision Foundation Models an...	2023-12-21	Code
2	BLIP-2 ViT-G (zero-shot, 1K test set)	100	No	BLIP-2: Bootstrapping Language-Image Pre-trainin...	2023-01-30	Code
3	ONE-PEACE (finetuned, w/o ranking)	100	No	ONE-PEACE: Exploring One General Representation ...	2023-05-18	Code
4	InternVL-C-FT (finetuned, w/o ranking)	100	No	InternVL: Scaling up Vision Foundation Models an...	2023-12-21	Code
5	BLIP-2 ViT-L (zero-shot, 1K test set)	100	No	BLIP-2: Bootstrapping Language-Image Pre-trainin...	2023-01-30	Code
6	ERNIE-ViL 2.0	99.9	No	ERNIE-ViL 2.0: Multi-view Contrastive Learning f...	2022-09-30	Code
7	ALBEF	99.8	No	Align before Fuse: Vision and Language Represent...	2021-07-16	Code
8	ALBEF	99.3	No	HADA: A Graph-based Amalgamation Framework in Im...	2023-01-11	Code
9	UNITER	98	No	HADA: A Graph-based Amalgamation Framework in Im...	2023-01-11	Code
10	GSMN	94.3	No	A Deep Local and Global Scene-Graph Matching for...	2021-06-04	Code
11	LGSGM	91.9	No	A Deep Local and Global Scene-Graph Matching for...	2021-06-04	Code

#1InternVL-G-FT (finetuned, w/o ranking)
100
Recall@5· 2023-12-21
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks Code
#2BLIP-2 ViT-G (zero-shot, 1K test set)SOTA
100
Recall@5· 2023-01-30
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models Code
#3ONE-PEACE (finetuned, w/o ranking)
100
Recall@5· 2023-05-18
ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities Code
#4InternVL-C-FT (finetuned, w/o ranking)
100
Recall@5· 2023-12-21
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks Code
#5BLIP-2 ViT-L (zero-shot, 1K test set)
100
Recall@5· 2023-01-30
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models Code
#6ERNIE-ViL 2.0SOTA
99.9
Recall@5· 2022-09-30
ERNIE-ViL 2.0: Multi-view Contrastive Learning for Image-Text Pre-training Code
#7ALBEFSOTA
99.8
Recall@5· 2021-07-16
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation Code
#8ALBEF
99.3
Recall@5· 2023-01-11
HADA: A Graph-based Amalgamation Framework in Image-text Retrieval Code
#9UNITER
98
Recall@5· 2023-01-11
HADA: A Graph-based Amalgamation Framework in Image-text Retrieval Code
#10GSMNSOTA
94.3
Recall@5· 2021-06-04
A Deep Local and Global Scene-Graph Matching for Image-Text Retrieval Code
#11LGSGM
91.9
Recall@5· 2021-06-04
A Deep Local and Global Scene-Graph Matching for Image-Text Retrieval Code