Image-to-Text Retrieval on COCO (Common Objects in Context)

Metric: Recall@10 (higher is better)

LeaderboardDataset

Loading chart...

Results

Submit a result

Sort:

#	Model↕	Recall@10▼	Extra Data	Paper	Date↕	Code
1	Oscar	99.8	No	Oscar: Object-Semantics Aligned Pre-training for...	2020-04-13	Code
2	BLIP-2 (ViT-G, fine-tuned)	98.5	No	BLIP-2: Bootstrapping Language-Image Pre-trainin...	2023-01-30	Code
3	ONE-PEACE (ViT-G, w/o ranking)	98.3	No	ONE-PEACE: Exploring One General Representation ...	2023-05-18	Code
4	BLIP-2 (ViT-L, fine-tuned)	98	No	BLIP-2: Bootstrapping Language-Image Pre-trainin...	2023-01-30	Code
5	Unicoder-VL	97.2	No	Unicoder-VL: A Universal Encoder for Vision and ...	2019-08-16	-
6	IAIS	94.48	No	Learning Relation Alignment for Calibrated Cross...	2021-05-28	Code
7	CLIP (zero-shot)	88.1	No	Learning Transferable Visual Models From Natural...	2021-02-26	Code
8	DVSA	74.8	No	Deep Visual-Semantic Alignments for Generating I...	2014-12-07	Code

#1OscarSOTA
99.8
Recall@10· 2020-04-13
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks Code
#2BLIP-2 (ViT-G, fine-tuned)
98.5
Recall@10· 2023-01-30
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models Code
#3ONE-PEACE (ViT-G, w/o ranking)
98.3
Recall@10· 2023-05-18
ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities Code
#4BLIP-2 (ViT-L, fine-tuned)
98
Recall@10· 2023-01-30
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models Code
#5Unicoder-VLSOTA
97.2
Recall@10· 2019-08-16
Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training
#6IAIS
94.48
Recall@10· 2021-05-28
Learning Relation Alignment for Calibrated Cross-modal Retrieval Code
#7CLIP (zero-shot)
88.1
Recall@10· 2021-02-26
Learning Transferable Visual Models From Natural Language Supervision Code
#8DVSASOTA
74.8
Recall@10· 2014-12-07
Deep Visual-Semantic Alignments for Generating Image Descriptions Code