Visual Reasoning on Winoground

Metric: Text Score (higher is better)

LeaderboardDataset

Loading chart...

Results

Sort:

#	Model↕	Text Score▼	Extra Data	Paper	Date↕	Code
1	GPT-4o + CA	75.5	No	A Cognitive Paradigm Approach to Probe the Perce...	2025-01-23	-
2	GPT-4V (CoT, pick b/w two options)	75.25	No	The Role of Chain-of-Thought in Complex Vision-L...	2023-11-15	-
3	GPT-4V (pick b/w two options)	69.25	No	The Role of Chain-of-Thought in Complex Vision-L...	2023-11-15	-
4	MMICL + CoCoT	64.25	No	CoCoT: Contrastive Chain-of-Thought Prompting fo...	2024-01-05	Code
5	GPT-4V + CoCoT	58.5	No	CoCoT: Contrastive Chain-of-Thought Prompting fo...	2024-01-05	Code
6	OpenFlamingo + CoCoT	58.25	No	CoCoT: Contrastive Chain-of-Thought Prompting fo...	2024-01-05	Code
7	GPT-4V	54.5	No	CoCoT: Contrastive Chain-of-Thought Prompting fo...	2024-01-05	Code
8	FIBER (EqSim)	51.5	No	Equivariant Similarity for Vision-Language Found...	2023-03-25	Code
9	FIBER (finetuned, Flickr30k)	51.25	No	Equivariant Similarity for Vision-Language Found...	2023-03-25	Code
10	MMICL + CCoT	51	No	CoCoT: Contrastive Chain-of-Thought Prompting fo...	2024-01-05	Code
11	OpenFlamingo + DDCoT	47.5	No	CoCoT: Contrastive Chain-of-Thought Prompting fo...	2024-01-05	Code
12	VQ2	47	No	What You See is What You Read? Improving Text-Im...	2023-05-17	Code
13	MMICL + DDCoT	46.75	No	CoCoT: Contrastive Chain-of-Thought Prompting fo...	2024-01-05	Code
14	X-VLM 16M	46.7	No	Measuring Progress in Fine-grained Vision-and-La...	2023-05-12	Code
15	PaLI (ft SNLI-VE + Synthetic Data)	46.5	No	What You See is What You Read? Improving Text-Im...	2023-05-17	Code
16	FIBER	46.25	No	Equivariant Similarity for Vision-Language Found...	2023-03-25	Code
17	MMICL (FLAN-T5-XXL)	45.5	No	MMICL: Empowering Vision-language Model with Mul...	2023-09-14	Code
18	PaLI (ft SNLI-VE)	45	No	What You See is What You Read? Improving Text-Im...	2023-05-17	Code
19	Gemini + DDCoT	45	No	CoCoT: Contrastive Chain-of-Thought Prompting fo...	2024-01-05	Code
20	METER (EqSim)	45	No	Equivariant Similarity for Vision-Language Found...	2023-03-25	Code
21	X-VLM 4M	44	No	Measuring Progress in Fine-grained Vision-and-La...	2023-05-12	Code
22	BLIP2 (ft COCO)	44	No	What You See is What You Read? Improving Text-Im...	2023-05-17	Code
23	KeyComp* (GPT-4)	43.5	No	Prompting Large Vision-Language Models for Compo...	2024-01-20	Code
24	METER (finetuned, Flickr30k)	43.5	No	Equivariant Similarity for Vision-Language Found...	2023-03-25	Code
25	BLIP2 (SGVL)	42.8	No	Incorporating Structured Representations into Pr...	2023-05-10	-
26	BLIP (SGVL)	42.8	No	Incorporating Structured Representations into Pr...	2023-05-10	-
27	KeyComp* (GPT-3.5)	42.7	No	Prompting Large Vision-Language Models for Compo...	2024-01-20	Code
28	OpenFlamingo + CCoT	42.5	No	CoCoT: Contrastive Chain-of-Thought Prompting fo...	2024-01-05	Code
29	NegBLIP	42.5	No	Incorporating Structured Representations into Pr...	2023-05-10	-
30	IAIS large (Flickr30k)	42.5	No	-	-	-
31	LLaVA-1.5-CCoT	42	No	Compositional Chain-of-Thought Prompting for Lar...	2023-11-27	Code
32	BLIP2	42	No	Incorporating Structured Representations into Pr...	2023-05-10	-
33	IAIS large (COCO)	41.75	No	-	-	-
34	NegBLIP2	41.5	No	Incorporating Structured Representations into Pr...	2023-05-10	-
35	BLIP (+Graph Text, +Graph Neg)	40.5	No	Incorporating Structured Representations into Pr...	2023-05-10	-
36	BLIP (+Graph Text)	40.3	No	Incorporating Structured Representations into Pr...	2023-05-10	-
37	Gemini + CoCoT	40	No	CoCoT: Contrastive Chain-of-Thought Prompting fo...	2024-01-05	Code
38	CACR base	39.25	No	-	-	-
39	METER	39.25	No	Equivariant Similarity for Vision-Language Found...	2023-03-25	Code
40	OpenFlamingo	39	No	CoCoT: Contrastive Chain-of-Thought Prompting fo...	2024-01-05	Code
41	BLIP	39	No	Incorporating Structured Representations into Pr...	2023-05-10	-
42	GPT-4V (image-caption match answer yes/no, zero-shot)	38	No	-	-	-
43	UNITER large	38	No	Winoground: Probing Vision and Language Models f...	2022-04-07	Code
44	VinVL	37.75	No	Winoground: Probing Vision and Language Models f...	2022-04-07	Code
45	ViLLA large	37	No	Winoground: Probing Vision and Language Models f...	2022-04-07	Code
46	BLIP (VisualGPTScore, α-tuned)	36.5	No	Revisiting the Role of Language Priors in Vision...	2023-06-02	Code
47	BLIP 14M	36.5	No	Measuring Progress in Fine-grained Vision-and-La...	2023-05-12	Code
48	ViT-B/16 + BERT base + ViLEM	36.5	No	-	-	-
49	LLaVA-1.5	36	No	Compositional Chain-of-Thought Prompting for Lar...	2023-11-27	Code
50	BLIP (ITM)	35.8	No	Revisiting the Role of Language Priors in Vision...	2023-06-02	Code
51	BLIP 129M	35.5	No	Measuring Progress in Fine-grained Vision-and-La...	2023-05-12	Code
52	ROSITA (Flickr30k)	35.25	No	-	-	-
53	ViLT (ViT-B/32)	34.75	No	Winoground: Probing Vision and Language Models f...	2022-04-07	Code
54	BLIP 129M (CapFilt/L)	34.7	No	Measuring Progress in Fine-grained Vision-and-La...	2023-05-12	Code
55	BLIP-ViT/L 129M	34.7	No	Measuring Progress in Fine-grained Vision-and-La...	2023-05-12	Code
56	Diffusion Classifier (zero-shot)	34	No	Your Diffusion Model is Secretly a Zero-Shot Cla...	2023-03-28	Code
57	PEVL 14M	33.2	No	Measuring Progress in Fine-grained Vision-and-La...	2023-05-12	Code
58	ALBEF 14M	32.5	No	Measuring Progress in Fine-grained Vision-and-La...	2023-05-12	Code
59	FLAVA (ITM)	32.25	No	Winoground: Probing Vision and Language Models f...	2022-04-07	Code
60	UNITER base	32.25	No	Winoground: Probing Vision and Language Models f...	2022-04-07	Code
61	CLIP (SGVL)	32	No	Incorporating Structured Representations into Pr...	2023-05-10	-
62	ViT-B/16 + BERT base	31.2	No	-	-	-
63	Gemini	30.75	No	CoCoT: Contrastive Chain-of-Thought Prompting fo...	2024-01-05	Code
64	OCLIP (ViT-H/14)	30.75	No	SelfEval: Leveraging the discriminative nature o...	2023-11-17	-
65	CLIP (ViT-B/32)	30.75	No	Winoground: Probing Vision and Language Models f...	2022-04-07	Code
66	OFA large (ITM)	30.75	No	Simple Token-Level Confidence Improves Caption C...	2023-05-11	-
67	KeyComp (GPT-3.5)	30.3	No	Prompting Large Vision-Language Models for Compo...	2024-01-20	Code
68	CLIP (ViT-L/14)	30.25	No	SelfEval: Leveraging the discriminative nature o...	2023-11-17	-
69	ViLLA base	30	No	Winoground: Probing Vision and Language Models f...	2022-04-07	Code
70	syn-CLIP	30	No	Going Beyond Nouns With Vision & Language Models...	2023-03-30	Code
71	syn-CyCLIP	30	No	Going Beyond Nouns With Vision & Language Models...	2023-03-30	Code
72	NegCLIP	29.5	No	Incorporating Structured Representations into Pr...	2023-05-10	-
73	OFA large (TLC-A)	29.25	No	Simple Token-Level Confidence Improves Caption C...	2023-05-11	-
74	ALBEF 4M	29.2	No	Measuring Progress in Fine-grained Vision-and-La...	2023-05-12	Code
75	LDM-T5 (SelfEval)	29	No	SelfEval: Leveraging the discriminative nature o...	2023-11-17	-
76	CyCLIP	28.5	No	Going Beyond Nouns With Vision & Language Models...	2023-03-30	Code
77	PDM-T5 (SelfEval)	28.25	No	SelfEval: Leveraging the discriminative nature o...	2023-11-17	-
78	COCA ViT-L14 (f.t on COCO)	28.25	No	What You See is What You Read? Improving Text-Im...	2023-05-17	Code
79	LLaVA-1.5-ZS-CoT	28	No	Compositional Chain-of-Thought Prompting for Lar...	2023-11-27	Code
80	BLIP (ITC)	28	No	Revisiting the Role of Language Priors in Vision...	2023-06-02	Code
81	OFA large (ft SNLI-VE)	27.7	No	What You See is What You Read? Improving Text-Im...	2023-05-17	Code
82	OFA base (ITM)	26.75	No	Simple Token-Level Confidence Improves Caption C...	2023-05-11	-
83	CLIP RN50x64	26.5	No	What You See is What You Read? Improving Text-Im...	2023-05-17	Code
84	LLaVA-7B (GPTScore)	25.5	No	An Examination of the Compositionality of Large ...	2023-08-21	Code
85	FLAVA (contrastive)	25.25	No	Winoground: Probing Vision and Language Models f...	2022-04-07	Code
86	Random chance	25	No	Winoground: Probing Vision and Language Models f...	2022-04-07	Code
87	LLaVA	24.8	No	Incorporating Structured Representations into Pr...	2023-05-10	-
88	OFA base (TLC-A)	24.5	No	Simple Token-Level Confidence Improves Caption C...	2023-05-11	-
89	MiniGPT-4-7B (GPTScore)	24.5	No	An Examination of the Compositionality of Large ...	2023-08-21	Code
90	ViLBERT base	23.75	No	Winoground: Probing Vision and Language Models f...	2022-04-07	Code
91	MiniGPT-4	23.3	No	Incorporating Structured Representations into Pr...	2023-05-10	-
92	MiniGPT-4-7B (VisualGPTScore)	23.25	No	An Examination of the Compositionality of Large ...	2023-08-21	Code
93	VSE++ (COCO, ResNet)	22.75	No	Winoground: Probing Vision and Language Models f...	2022-04-07	Code
94	OFA tiny (ITM)	22.75	No	Simple Token-Level Confidence Improves Caption C...	2023-05-11	-
95	LDM-CLIP (SelfEval)	22.75	No	SelfEval: Leveraging the discriminative nature o...	2023-11-17	-
96	Gemini + CCoT	22.5	No	CoCoT: Contrastive Chain-of-Thought Prompting fo...	2024-01-05	Code
97	InstructBLIP-CCoT	21	No	Compositional Chain-of-Thought Prompting for Lar...	2023-11-27	Code
98	VSRN (Flickr30k)	20	No	Winoground: Probing Vision and Language Models f...	2022-04-07	Code
99	VSE++ (Flickr30k, ResNet)	20	No	Winoground: Probing Vision and Language Models f...	2022-04-07	Code
100	VSE++ (Flickr30k, VGG)	19.75	No	Winoground: Probing Vision and Language Models f...	2022-04-07	Code
101	UniT (ITM finetuned)	19.5	No	Winoground: Probing Vision and Language Models f...	2022-04-07	Code
102	LXMERT	19.25	No	Winoground: Probing Vision and Language Models f...	2022-04-07	Code
103	TIFA	19	No	What You See is What You Read? Improving Text-Im...	2023-05-17	Code
104	IDEFICS 80B	18.75	No	-	-	-
105	VSE++ (COCO, VGG)	18.75	No	Winoground: Probing Vision and Language Models f...	2022-04-07	Code
106	VSRN (COCO)	17.5	No	Winoground: Probing Vision and Language Models f...	2022-04-07	Code
107	PDM-CLIP (SelfEval)	17	No	SelfEval: Leveraging the discriminative nature o...	2023-11-17	-
108	IDEFICS 9B	16.8	No	-	-	-
109	OFA tiny (TLC-A)	16.5	No	Simple Token-Level Confidence Improves Caption C...	2023-05-11	-
110	VisualBERT base	15.5	No	Winoground: Probing Vision and Language Models f...	2022-04-07	Code
111	MiniGPT-4-7B (BERTScore)	14	No	An Examination of the Compositionality of Large ...	2023-08-21	Code
112	LLaVA-7B (BERTScore)	13.5	No	An Examination of the Compositionality of Large ...	2023-08-21	Code
113	InstructBLIP-ZS-CoT	9.3	No	Compositional Chain-of-Thought Prompting for Lar...	2023-11-27	Code
114	InstructBLIP	7	No	Compositional Chain-of-Thought Prompting for Lar...	2023-11-27	Code

#1GPT-4o + CASOTA
75.5
Text Score· 2025-01-23
A Cognitive Paradigm Approach to Probe the Perception-Reasoning Interface in VLMs
#2GPT-4V (CoT, pick b/w two options)SOTA
75.25
Text Score· 2023-11-15
The Role of Chain-of-Thought in Complex Vision-Language Reasoning Task
#3GPT-4V (pick b/w two options)
69.25
Text Score· 2023-11-15
The Role of Chain-of-Thought in Complex Vision-Language Reasoning Task
#4MMICL + CoCoT
64.25
Text Score· 2024-01-05
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs Code
#5GPT-4V + CoCoT
58.5
Text Score· 2024-01-05
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs Code
#6OpenFlamingo + CoCoT
58.25
Text Score· 2024-01-05
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs Code
#7GPT-4V
54.5
Text Score· 2024-01-05
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs Code
#8FIBER (EqSim)SOTA
51.5
Text Score· 2023-03-25
Equivariant Similarity for Vision-Language Foundation Models Code
#9FIBER (finetuned, Flickr30k)
51.25
Text Score· 2023-03-25
Equivariant Similarity for Vision-Language Foundation Models Code
#10MMICL + CCoT
51
Text Score· 2024-01-05
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs Code
#11OpenFlamingo + DDCoT
47.5
Text Score· 2024-01-05
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs Code
#12VQ2
47
Text Score· 2023-05-17
What You See is What You Read? Improving Text-Image Alignment Evaluation Code
#13MMICL + DDCoT
46.75
Text Score· 2024-01-05
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs Code
#14X-VLM 16M
46.7
Text Score· 2023-05-12
Measuring Progress in Fine-grained Vision-and-Language Understanding Code
#15PaLI (ft SNLI-VE + Synthetic Data)
46.5
Text Score· 2023-05-17
What You See is What You Read? Improving Text-Image Alignment Evaluation Code
#16FIBER
46.25
Text Score· 2023-03-25
Equivariant Similarity for Vision-Language Foundation Models Code
#17MMICL (FLAN-T5-XXL)
45.5
Text Score· 2023-09-14
MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning Code
#18PaLI (ft SNLI-VE)
45
Text Score· 2023-05-17
What You See is What You Read? Improving Text-Image Alignment Evaluation Code
#19Gemini + DDCoT
45
Text Score· 2024-01-05
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs Code
#20METER (EqSim)
45
Text Score· 2023-03-25
Equivariant Similarity for Vision-Language Foundation Models Code
#21X-VLM 4M
44
Text Score· 2023-05-12
Measuring Progress in Fine-grained Vision-and-Language Understanding Code
#22BLIP2 (ft COCO)
44
Text Score· 2023-05-17
What You See is What You Read? Improving Text-Image Alignment Evaluation Code
#23KeyComp* (GPT-4)
43.5
Text Score· 2024-01-20
Prompting Large Vision-Language Models for Compositional Reasoning Code
#24METER (finetuned, Flickr30k)
43.5
Text Score· 2023-03-25
Equivariant Similarity for Vision-Language Foundation Models Code
#25BLIP2 (SGVL)
42.8
Text Score· 2023-05-10
Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs
#26BLIP (SGVL)
42.8
Text Score· 2023-05-10
Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs
#27KeyComp* (GPT-3.5)
42.7
Text Score· 2024-01-20
Prompting Large Vision-Language Models for Compositional Reasoning Code
#28OpenFlamingo + CCoT
42.5
Text Score· 2024-01-05
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs Code
#29NegBLIP
42.5
Text Score· 2023-05-10
Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs
#30IAIS large (Flickr30k)
42.5
Text Score
No paper
#31LLaVA-1.5-CCoT
42
Text Score· 2023-11-27
Compositional Chain-of-Thought Prompting for Large Multimodal Models Code
#32BLIP2
42
Text Score· 2023-05-10
Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs
#33IAIS large (COCO)
41.75
Text Score
No paper
#34NegBLIP2
41.5
Text Score· 2023-05-10
Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs
#35BLIP (+Graph Text, +Graph Neg)
40.5
Text Score· 2023-05-10
Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs
#36BLIP (+Graph Text)
40.3
Text Score· 2023-05-10
Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs
#37Gemini + CoCoT
40
Text Score· 2024-01-05
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs Code
#38CACR base
39.25
Text Score
No paper
#39METER
39.25
Text Score· 2023-03-25
Equivariant Similarity for Vision-Language Foundation Models Code
#40OpenFlamingo
39
Text Score· 2024-01-05
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs Code
#41BLIP
39
Text Score· 2023-05-10
Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs
#42GPT-4V (image-caption match answer yes/no, zero-shot)
38
Text Score
No paper
#43UNITER largeSOTA
38
Text Score· 2022-04-07
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality Code
#44VinVL
37.75
Text Score· 2022-04-07
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality Code
#45ViLLA large
37
Text Score· 2022-04-07
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality Code
#46BLIP (VisualGPTScore, α-tuned)
36.5
Text Score· 2023-06-02
Revisiting the Role of Language Priors in Vision-Language Models Code
#47BLIP 14M
36.5
Text Score· 2023-05-12
Measuring Progress in Fine-grained Vision-and-Language Understanding Code
#48ViT-B/16 + BERT base + ViLEM
36.5
Text Score
No paper
#49LLaVA-1.5
36
Text Score· 2023-11-27
Compositional Chain-of-Thought Prompting for Large Multimodal Models Code
#50BLIP (ITM)
35.8
Text Score· 2023-06-02
Revisiting the Role of Language Priors in Vision-Language Models Code
#51BLIP 129M
35.5
Text Score· 2023-05-12
Measuring Progress in Fine-grained Vision-and-Language Understanding Code
#52ROSITA (Flickr30k)
35.25
Text Score
No paper
#53ViLT (ViT-B/32)
34.75
Text Score· 2022-04-07
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality Code
#54BLIP 129M (CapFilt/L)
34.7
Text Score· 2023-05-12
Measuring Progress in Fine-grained Vision-and-Language Understanding Code
#55BLIP-ViT/L 129M
34.7
Text Score· 2023-05-12
Measuring Progress in Fine-grained Vision-and-Language Understanding Code
#56Diffusion Classifier (zero-shot)
34
Text Score· 2023-03-28
Your Diffusion Model is Secretly a Zero-Shot Classifier Code
#57PEVL 14M
33.2
Text Score· 2023-05-12
Measuring Progress in Fine-grained Vision-and-Language Understanding Code
#58ALBEF 14M
32.5
Text Score· 2023-05-12
Measuring Progress in Fine-grained Vision-and-Language Understanding Code
#59FLAVA (ITM)
32.25
Text Score· 2022-04-07
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality Code
#60UNITER base
32.25
Text Score· 2022-04-07
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality Code
#61CLIP (SGVL)
32
Text Score· 2023-05-10
Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs
#62ViT-B/16 + BERT base
31.2
Text Score
No paper
#63Gemini
30.75
Text Score· 2024-01-05
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs Code
#64OCLIP (ViT-H/14)
30.75
Text Score· 2023-11-17
SelfEval: Leveraging the discriminative nature of generative models for evaluation
#65CLIP (ViT-B/32)
30.75
Text Score· 2022-04-07
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality Code
#66OFA large (ITM)
30.75
Text Score· 2023-05-11
Simple Token-Level Confidence Improves Caption Correctness
#67KeyComp (GPT-3.5)
30.3
Text Score· 2024-01-20
Prompting Large Vision-Language Models for Compositional Reasoning Code
#68CLIP (ViT-L/14)
30.25
Text Score· 2023-11-17
SelfEval: Leveraging the discriminative nature of generative models for evaluation
#69ViLLA base
30
Text Score· 2022-04-07
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality Code
#70syn-CLIP
30
Text Score· 2023-03-30
Going Beyond Nouns With Vision & Language Models Using Synthetic Data Code
#71syn-CyCLIP
30
Text Score· 2023-03-30
Going Beyond Nouns With Vision & Language Models Using Synthetic Data Code
#72NegCLIP
29.5
Text Score· 2023-05-10
Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs
#73OFA large (TLC-A)
29.25
Text Score· 2023-05-11
Simple Token-Level Confidence Improves Caption Correctness
#74ALBEF 4M
29.2
Text Score· 2023-05-12
Measuring Progress in Fine-grained Vision-and-Language Understanding Code
#75LDM-T5 (SelfEval)
29
Text Score· 2023-11-17
SelfEval: Leveraging the discriminative nature of generative models for evaluation
#76CyCLIP
28.5
Text Score· 2023-03-30
Going Beyond Nouns With Vision & Language Models Using Synthetic Data Code
#77PDM-T5 (SelfEval)
28.25
Text Score· 2023-11-17
SelfEval: Leveraging the discriminative nature of generative models for evaluation
#78COCA ViT-L14 (f.t on COCO)
28.25
Text Score· 2023-05-17
What You See is What You Read? Improving Text-Image Alignment Evaluation Code
#79LLaVA-1.5-ZS-CoT
28
Text Score· 2023-11-27
Compositional Chain-of-Thought Prompting for Large Multimodal Models Code
#80BLIP (ITC)
28
Text Score· 2023-06-02
Revisiting the Role of Language Priors in Vision-Language Models Code
#81OFA large (ft SNLI-VE)
27.7
Text Score· 2023-05-17
What You See is What You Read? Improving Text-Image Alignment Evaluation Code
#82OFA base (ITM)
26.75
Text Score· 2023-05-11
Simple Token-Level Confidence Improves Caption Correctness
#83CLIP RN50x64
26.5
Text Score· 2023-05-17
What You See is What You Read? Improving Text-Image Alignment Evaluation Code
#84LLaVA-7B (GPTScore)
25.5
Text Score· 2023-08-21
An Examination of the Compositionality of Large Generative Vision-Language Models Code
#85FLAVA (contrastive)
25.25
Text Score· 2022-04-07
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality Code
#86Random chance
25
Text Score· 2022-04-07
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality Code
#87LLaVA
24.8
Text Score· 2023-05-10
Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs
#88OFA base (TLC-A)
24.5
Text Score· 2023-05-11
Simple Token-Level Confidence Improves Caption Correctness
#89MiniGPT-4-7B (GPTScore)
24.5
Text Score· 2023-08-21
An Examination of the Compositionality of Large Generative Vision-Language Models Code
#90ViLBERT base
23.75
Text Score· 2022-04-07
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality Code
#91MiniGPT-4
23.3
Text Score· 2023-05-10
Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs
#92MiniGPT-4-7B (VisualGPTScore)
23.25
Text Score· 2023-08-21
An Examination of the Compositionality of Large Generative Vision-Language Models Code
#93VSE++ (COCO, ResNet)
22.75
Text Score· 2022-04-07
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality Code
#94OFA tiny (ITM)
22.75
Text Score· 2023-05-11
Simple Token-Level Confidence Improves Caption Correctness
#95LDM-CLIP (SelfEval)
22.75
Text Score· 2023-11-17
SelfEval: Leveraging the discriminative nature of generative models for evaluation
#96Gemini + CCoT
22.5
Text Score· 2024-01-05
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs Code
#97InstructBLIP-CCoT
21
Text Score· 2023-11-27
Compositional Chain-of-Thought Prompting for Large Multimodal Models Code
#98VSRN (Flickr30k)
20
Text Score· 2022-04-07
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality Code
#99VSE++ (Flickr30k, ResNet)
20
Text Score· 2022-04-07
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality Code
#100VSE++ (Flickr30k, VGG)
19.75
Text Score· 2022-04-07
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality Code
#101UniT (ITM finetuned)
19.5
Text Score· 2022-04-07
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality Code
#102LXMERT
19.25
Text Score· 2022-04-07
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality Code
#103TIFA
19
Text Score· 2023-05-17
What You See is What You Read? Improving Text-Image Alignment Evaluation Code
#104IDEFICS 80B
18.75
Text Score
No paper
#105VSE++ (COCO, VGG)
18.75
Text Score· 2022-04-07
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality Code
#106VSRN (COCO)
17.5
Text Score· 2022-04-07
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality Code
#107PDM-CLIP (SelfEval)
17
Text Score· 2023-11-17
SelfEval: Leveraging the discriminative nature of generative models for evaluation
#108IDEFICS 9B
16.8
Text Score
No paper
#109OFA tiny (TLC-A)
16.5
Text Score· 2023-05-11
Simple Token-Level Confidence Improves Caption Correctness
#110VisualBERT base
15.5
Text Score· 2022-04-07
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality Code
#111MiniGPT-4-7B (BERTScore)
14
Text Score· 2023-08-21
An Examination of the Compositionality of Large Generative Vision-Language Models Code
#112LLaVA-7B (BERTScore)
13.5
Text Score· 2023-08-21
An Examination of the Compositionality of Large Generative Vision-Language Models Code
#113InstructBLIP-ZS-CoT
9.3
Text Score· 2023-11-27
Compositional Chain-of-Thought Prompting for Large Multimodal Models Code
#114InstructBLIP
7
Text Score· 2023-11-27
Compositional Chain-of-Thought Prompting for Large Multimodal Models Code