Visual Reasoning on Winoground

Metric: Image Score (higher is better)

LeaderboardDataset

Loading chart...

Results

Sort:

#	Model↕	Image Score▼	Extra Data	Paper	Date↕	Code
1	GPT-4V (CoT, pick b/w two options)	68.75	No	The Role of Chain-of-Thought in Complex Vision-L...	2023-11-15	-
2	GPT-4o + CA	58.5	No	A Cognitive Paradigm Approach to Probe the Perce...	2025-01-23	-
3	OpenFlamingo + CoCoT	55.25	No	CoCoT: Contrastive Chain-of-Thought Prompting fo...	2024-01-05	Code
4	MMICL + CoCoT	52.5	No	CoCoT: Contrastive Chain-of-Thought Prompting fo...	2024-01-05	Code
5	GPT-4V + CoCoT	49.5	No	CoCoT: Contrastive Chain-of-Thought Prompting fo...	2024-01-05	Code
6	MMICL + CCoT	48	No	CoCoT: Contrastive Chain-of-Thought Prompting fo...	2024-01-05	Code
7	OpenFlamingo + DDCoT	47.25	No	CoCoT: Contrastive Chain-of-Thought Prompting fo...	2024-01-05	Code
8	GPT-4V (pick b/w two options)	46.25	No	The Role of Chain-of-Thought in Complex Vision-L...	2023-11-15	-
9	MMICL + DDCoT	45	No	CoCoT: Contrastive Chain-of-Thought Prompting fo...	2024-01-05	Code
10	MMICL (FLAN-T5-XXL)	44.99	No	MMICL: Empowering Vision-language Model with Mul...	2023-09-14	Code
11	GPT-4V	42.5	No	CoCoT: Contrastive Chain-of-Thought Prompting fo...	2024-01-05	Code
12	VQ2	42.2	No	What You See is What You Read? Improving Text-Im...	2023-05-17	Code
13	PaLI (ft SNLI-VE)	41.5	No	What You See is What You Read? Improving Text-Im...	2023-05-17	Code
14	OpenFlamingo	41.25	No	CoCoT: Contrastive Chain-of-Thought Prompting fo...	2024-01-05	Code
15	PaLI (ft SNLI-VE + Synthetic Data)	38	No	What You See is What You Read? Improving Text-Im...	2023-05-17	Code
16	GPT-4V (image-caption match answer yes/no, zero-shot)	38	No	-	-	-
17	LLaVA-1.5-CCoT	35.5	No	Compositional Chain-of-Thought Prompting for Lar...	2023-11-27	Code
18	LLaVA-1.5	33.3	No	Compositional Chain-of-Thought Prompting for Lar...	2023-11-27	Code
19	Gemini + CCoT	33	No	CoCoT: Contrastive Chain-of-Thought Prompting fo...	2024-01-05	Code
20	Gemini + CoCoT	32.5	No	CoCoT: Contrastive Chain-of-Thought Prompting fo...	2024-01-05	Code
21	FIBER (EqSim)	32	No	Equivariant Similarity for Vision-Language Found...	2023-03-25	Code
22	KeyComp* (GPT-4)	28.7	No	Prompting Large Vision-Language Models for Compo...	2024-01-20	Code
23	BLIP2 (SGVL)	28.5	No	Incorporating Structured Representations into Pr...	2023-05-10	-
24	KeyComp* (GPT-3.5)	27.8	No	Prompting Large Vision-Language Models for Compo...	2024-01-20	Code
25	OpenFlamingo + CCoT	27.5	No	CoCoT: Contrastive Chain-of-Thought Prompting fo...	2024-01-05	Code
26	BLIP (SGVL)	27.3	No	Incorporating Structured Representations into Pr...	2023-05-10	-
27	OFA large (TLC-A)	27	No	Simple Token-Level Confidence Improves Caption C...	2023-05-11	-
28	X-VLM 4M	26.7	No	Measuring Progress in Fine-grained Vision-and-La...	2023-05-12	Code
29	FIBER (finetuned, Flickr30k)	26.5	No	Equivariant Similarity for Vision-Language Found...	2023-03-25	Code
30	BLIP2 (ft COCO)	26	No	What You See is What You Read? Improving Text-Im...	2023-05-17	Code
31	NegBLIP2	26	No	Incorporating Structured Representations into Pr...	2023-05-10	-
32	Gemini	26	No	CoCoT: Contrastive Chain-of-Thought Prompting fo...	2024-01-05	Code
33	FIBER	25.75	No	Equivariant Similarity for Vision-Language Found...	2023-03-25	Code
34	BLIP (+Graph Text, +Graph Neg)	25.5	No	Incorporating Structured Representations into Pr...	2023-05-10	-
35	Gemini + DDCoT	25	No	CoCoT: Contrastive Chain-of-Thought Prompting fo...	2024-01-05	Code
36	Random chance	25	No	Winoground: Probing Vision and Language Models f...	2022-04-07	Code
37	LLaVA	25	No	Incorporating Structured Representations into Pr...	2023-05-10	-
38	KeyComp (GPT-3.5)	24.6	No	Prompting Large Vision-Language Models for Compo...	2024-01-20	Code
39	X-VLM 16M	24.5	No	Measuring Progress in Fine-grained Vision-and-La...	2023-05-12	Code
40	NegBLIP	24	No	Incorporating Structured Representations into Pr...	2023-05-10	-
41	BLIP2	23.8	No	Incorporating Structured Representations into Pr...	2023-05-10	-
42	OFA base (TLC-A)	23.5	No	Simple Token-Level Confidence Improves Caption C...	2023-05-11	-
43	METER (EqSim)	22.75	No	Equivariant Similarity for Vision-Language Found...	2023-03-25	Code
44	LLaVA-1.5-ZS-CoT	22.5	No	Compositional Chain-of-Thought Prompting for Lar...	2023-11-27	Code
45	IDEFICS 80B	22.5	No	-	-	-
46	MiniGPT-4-7B (GPTScore)	21.75	No	An Examination of the Compositionality of Large ...	2023-08-21	Code
47	BLIP (VisualGPTScore, α-tuned)	21.5	No	Revisiting the Role of Language Priors in Vision...	2023-06-02	Code
48	InstructBLIP-CCoT	21.3	No	Compositional Chain-of-Thought Prompting for Lar...	2023-11-27	Code
49	IDEFICS 9B	20.8	No	-	-	-
50	METER (finetuned, Flickr30k)	20.75	No	Equivariant Similarity for Vision-Language Found...	2023-03-25	Code
51	BLIP (+Graph Text)	20.5	No	Incorporating Structured Representations into Pr...	2023-05-10	-
52	FLAVA (ITM)	20.5	No	Winoground: Probing Vision and Language Models f...	2022-04-07	Code
53	IAIS large (Flickr30k)	19.75	No	-	-	-
54	IAIS large (COCO)	19.75	No	-	-	-
55	BLIP	19.2	No	Incorporating Structured Representations into Pr...	2023-05-10	-
56	BLIP 14M	18.5	No	Measuring Progress in Fine-grained Vision-and-La...	2023-05-12	Code
57	MiniGPT-4	18	No	Incorporating Structured Representations into Pr...	2023-05-10	-
58	MiniGPT-4-7B (VisualGPTScore)	18	No	An Examination of the Compositionality of Large ...	2023-08-21	Code
59	CACR base	17.75	No	-	-	-
60	VinVL	17.75	No	Winoground: Probing Vision and Language Models f...	2022-04-07	Code
61	LLaVA-7B (GPTScore)	17	No	An Examination of the Compositionality of Large ...	2023-08-21	Code
62	InstructBLIP-ZS-CoT	16.3	No	Compositional Chain-of-Thought Prompting for Lar...	2023-11-27	Code
63	ALBEF 14M	16.2	No	Measuring Progress in Fine-grained Vision-and-La...	2023-05-12	Code
64	BLIP (ITM)	15.8	No	Revisiting the Role of Language Priors in Vision...	2023-06-02	Code
65	METER	15.75	No	Equivariant Similarity for Vision-Language Found...	2023-03-25	Code
66	OFA tiny (TLC-A)	15.75	No	Simple Token-Level Confidence Improves Caption C...	2023-05-11	-
67	PEVL 14M	15.7	No	Measuring Progress in Fine-grained Vision-and-La...	2023-05-12	Code
68	ALBEF 4M	15.5	No	Measuring Progress in Fine-grained Vision-and-La...	2023-05-12	Code
69	ROSITA (Flickr30k)	15.25	No	-	-	-
70	BLIP 129M (CapFilt/L)	15.2	No	Measuring Progress in Fine-grained Vision-and-La...	2023-05-12	Code
71	BLIP 129M	15	No	Measuring Progress in Fine-grained Vision-and-La...	2023-05-12	Code
72	BLIP-ViT/L 129M	14.5	No	Measuring Progress in Fine-grained Vision-and-La...	2023-05-12	Code
73	OFA large (ft SNLI-VE)	14.3	No	What You See is What You Read? Improving Text-Im...	2023-05-17	Code
74	UNITER large	14	No	Winoground: Probing Vision and Language Models f...	2022-04-07	Code
75	ViLT (ViT-B/32)	14	No	Winoground: Probing Vision and Language Models f...	2022-04-07	Code
76	CLIP (SGVL)	14	No	Incorporating Structured Representations into Pr...	2023-05-10	-
77	PDM-CLIP (SelfEval)	14	No	SelfEval: Leveraging the discriminative nature o...	2023-11-17	-
78	CLIP RN50x64	13.75	No	What You See is What You Read? Improving Text-Im...	2023-05-17	Code
79	LDM-T5 (SelfEval)	13.5	No	SelfEval: Leveraging the discriminative nature o...	2023-11-17	-
80	FLAVA (contrastive)	13.5	No	Winoground: Probing Vision and Language Models f...	2022-04-07	Code
81	ViLLA large	13.25	No	Winoground: Probing Vision and Language Models f...	2022-04-07	Code
82	UNITER base	13.25	No	Winoground: Probing Vision and Language Models f...	2022-04-07	Code
83	OCLIP (ViT-H/14)	12.75	No	SelfEval: Leveraging the discriminative nature o...	2023-11-17	-
84	TIFA	12.5	No	What You See is What You Read? Improving Text-Im...	2023-05-17	Code
85	ViLLA base	12	No	Winoground: Probing Vision and Language Models f...	2022-04-07	Code
86	PDM-T5 (SelfEval)	12	No	SelfEval: Leveraging the discriminative nature o...	2023-11-17	-
87	syn-CLIP	11.5	No	Going Beyond Nouns With Vision & Language Models...	2023-03-30	Code
88	COCA ViT-L14 (f.t on COCO)	11.5	No	What You See is What You Read? Improving Text-Im...	2023-05-17	Code
89	InstructBLIP	11.5	No	Compositional Chain-of-Thought Prompting for Lar...	2023-11-27	Code
90	syn-CyCLIP	10.75	No	Going Beyond Nouns With Vision & Language Models...	2023-03-30	Code
91	OFA base (ITM)	10.75	No	Simple Token-Level Confidence Improves Caption C...	2023-05-11	-
92	CLIP (ViT-B/32)	10.5	No	Winoground: Probing Vision and Language Models f...	2022-04-07	Code
93	NegCLIP	10.5	No	Incorporating Structured Representations into Pr...	2023-05-10	-
94	OFA large (ITM)	10.25	No	Simple Token-Level Confidence Improves Caption C...	2023-05-11	-
95	CyCLIP	9.5	No	Going Beyond Nouns With Vision & Language Models...	2023-03-30	Code
96	BLIP (ITC)	9	No	Revisiting the Role of Language Priors in Vision...	2023-06-02	Code
97	CLIP (ViT-L/14)	8	No	SelfEval: Leveraging the discriminative nature o...	2023-11-17	-
98	VSE++ (COCO, ResNet)	8	No	Winoground: Probing Vision and Language Models f...	2022-04-07	Code
99	MiniGPT-4-7B (BERTScore)	8	No	An Examination of the Compositionality of Large ...	2023-08-21	Code
100	OFA tiny (ITM)	7.75	No	Simple Token-Level Confidence Improves Caption C...	2023-05-11	-
101	ViLBERT base	7.25	No	Winoground: Probing Vision and Language Models f...	2022-04-07	Code
102	LDM-CLIP (SelfEval)	7.25	No	SelfEval: Leveraging the discriminative nature o...	2023-11-17	-
103	LXMERT	7	No	Winoground: Probing Vision and Language Models f...	2022-04-07	Code
104	VSRN (COCO)	7	No	Winoground: Probing Vision and Language Models f...	2022-04-07	Code
105	VSE++ (Flickr30k, VGG)	6.25	No	Winoground: Probing Vision and Language Models f...	2022-04-07	Code
106	UniT (ITM finetuned)	6.25	No	Winoground: Probing Vision and Language Models f...	2022-04-07	Code
107	VSE++ (COCO, VGG)	5.5	No	Winoground: Probing Vision and Language Models f...	2022-04-07	Code
108	LLaVA-7B (BERTScore)	5.25	No	An Examination of the Compositionality of Large ...	2023-08-21	Code
109	VSRN (Flickr30k)	5	No	Winoground: Probing Vision and Language Models f...	2022-04-07	Code
110	VSE++ (Flickr30k, ResNet)	5	No	Winoground: Probing Vision and Language Models f...	2022-04-07	Code
111	VisualBERT base	2.5	No	Winoground: Probing Vision and Language Models f...	2022-04-07	Code

#1GPT-4V (CoT, pick b/w two options)SOTA
68.75
Image Score· 2023-11-15
The Role of Chain-of-Thought in Complex Vision-Language Reasoning Task
#2GPT-4o + CA
58.5
Image Score· 2025-01-23
A Cognitive Paradigm Approach to Probe the Perception-Reasoning Interface in VLMs
#3OpenFlamingo + CoCoT
55.25
Image Score· 2024-01-05
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs Code
#4MMICL + CoCoT
52.5
Image Score· 2024-01-05
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs Code
#5GPT-4V + CoCoT
49.5
Image Score· 2024-01-05
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs Code
#6MMICL + CCoT
48
Image Score· 2024-01-05
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs Code
#7OpenFlamingo + DDCoT
47.25
Image Score· 2024-01-05
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs Code
#8GPT-4V (pick b/w two options)
46.25
Image Score· 2023-11-15
The Role of Chain-of-Thought in Complex Vision-Language Reasoning Task
#9MMICL + DDCoT
45
Image Score· 2024-01-05
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs Code
#10MMICL (FLAN-T5-XXL)SOTA
44.99
Image Score· 2023-09-14
MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning Code
#11GPT-4V
42.5
Image Score· 2024-01-05
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs Code
#12VQ2SOTA
42.2
Image Score· 2023-05-17
What You See is What You Read? Improving Text-Image Alignment Evaluation Code
#13PaLI (ft SNLI-VE)
41.5
Image Score· 2023-05-17
What You See is What You Read? Improving Text-Image Alignment Evaluation Code
#14OpenFlamingo
41.25
Image Score· 2024-01-05
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs Code
#15PaLI (ft SNLI-VE + Synthetic Data)
38
Image Score· 2023-05-17
What You See is What You Read? Improving Text-Image Alignment Evaluation Code
#16GPT-4V (image-caption match answer yes/no, zero-shot)
38
Image Score
No paper
#17LLaVA-1.5-CCoT
35.5
Image Score· 2023-11-27
Compositional Chain-of-Thought Prompting for Large Multimodal Models Code
#18LLaVA-1.5
33.3
Image Score· 2023-11-27
Compositional Chain-of-Thought Prompting for Large Multimodal Models Code
#19Gemini + CCoT
33
Image Score· 2024-01-05
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs Code
#20Gemini + CoCoT
32.5
Image Score· 2024-01-05
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs Code
#21FIBER (EqSim)SOTA
32
Image Score· 2023-03-25
Equivariant Similarity for Vision-Language Foundation Models Code
#22KeyComp* (GPT-4)
28.7
Image Score· 2024-01-20
Prompting Large Vision-Language Models for Compositional Reasoning Code
#23BLIP2 (SGVL)
28.5
Image Score· 2023-05-10
Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs
#24KeyComp* (GPT-3.5)
27.8
Image Score· 2024-01-20
Prompting Large Vision-Language Models for Compositional Reasoning Code
#25OpenFlamingo + CCoT
27.5
Image Score· 2024-01-05
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs Code
#26BLIP (SGVL)
27.3
Image Score· 2023-05-10
Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs
#27OFA large (TLC-A)
27
Image Score· 2023-05-11
Simple Token-Level Confidence Improves Caption Correctness
#28X-VLM 4M
26.7
Image Score· 2023-05-12
Measuring Progress in Fine-grained Vision-and-Language Understanding Code
#29FIBER (finetuned, Flickr30k)
26.5
Image Score· 2023-03-25
Equivariant Similarity for Vision-Language Foundation Models Code
#30BLIP2 (ft COCO)
26
Image Score· 2023-05-17
What You See is What You Read? Improving Text-Image Alignment Evaluation Code
#31NegBLIP2
26
Image Score· 2023-05-10
Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs
#32Gemini
26
Image Score· 2024-01-05
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs Code
#33FIBER
25.75
Image Score· 2023-03-25
Equivariant Similarity for Vision-Language Foundation Models Code
#34BLIP (+Graph Text, +Graph Neg)
25.5
Image Score· 2023-05-10
Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs
#35Gemini + DDCoT
25
Image Score· 2024-01-05
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs Code
#36Random chanceSOTA
25
Image Score· 2022-04-07
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality Code
#37LLaVA
25
Image Score· 2023-05-10
Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs
#38KeyComp (GPT-3.5)
24.6
Image Score· 2024-01-20
Prompting Large Vision-Language Models for Compositional Reasoning Code
#39X-VLM 16M
24.5
Image Score· 2023-05-12
Measuring Progress in Fine-grained Vision-and-Language Understanding Code
#40NegBLIP
24
Image Score· 2023-05-10
Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs
#41BLIP2
23.8
Image Score· 2023-05-10
Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs
#42OFA base (TLC-A)
23.5
Image Score· 2023-05-11
Simple Token-Level Confidence Improves Caption Correctness
#43METER (EqSim)
22.75
Image Score· 2023-03-25
Equivariant Similarity for Vision-Language Foundation Models Code
#44LLaVA-1.5-ZS-CoT
22.5
Image Score· 2023-11-27
Compositional Chain-of-Thought Prompting for Large Multimodal Models Code
#45IDEFICS 80B
22.5
Image Score
No paper
#46MiniGPT-4-7B (GPTScore)
21.75
Image Score· 2023-08-21
An Examination of the Compositionality of Large Generative Vision-Language Models Code
#47BLIP (VisualGPTScore, α-tuned)
21.5
Image Score· 2023-06-02
Revisiting the Role of Language Priors in Vision-Language Models Code
#48InstructBLIP-CCoT
21.3
Image Score· 2023-11-27
Compositional Chain-of-Thought Prompting for Large Multimodal Models Code
#49IDEFICS 9B
20.8
Image Score
No paper
#50METER (finetuned, Flickr30k)
20.75
Image Score· 2023-03-25
Equivariant Similarity for Vision-Language Foundation Models Code
#51BLIP (+Graph Text)
20.5
Image Score· 2023-05-10
Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs
#52FLAVA (ITM)
20.5
Image Score· 2022-04-07
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality Code
#53IAIS large (Flickr30k)
19.75
Image Score
No paper
#54IAIS large (COCO)
19.75
Image Score
No paper
#55BLIP
19.2
Image Score· 2023-05-10
Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs
#56BLIP 14M
18.5
Image Score· 2023-05-12
Measuring Progress in Fine-grained Vision-and-Language Understanding Code
#57MiniGPT-4
18
Image Score· 2023-05-10
Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs
#58MiniGPT-4-7B (VisualGPTScore)
18
Image Score· 2023-08-21
An Examination of the Compositionality of Large Generative Vision-Language Models Code
#59CACR base
17.75
Image Score
No paper
#60VinVL
17.75
Image Score· 2022-04-07
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality Code
#61LLaVA-7B (GPTScore)
17
Image Score· 2023-08-21
An Examination of the Compositionality of Large Generative Vision-Language Models Code
#62InstructBLIP-ZS-CoT
16.3
Image Score· 2023-11-27
Compositional Chain-of-Thought Prompting for Large Multimodal Models Code
#63ALBEF 14M
16.2
Image Score· 2023-05-12
Measuring Progress in Fine-grained Vision-and-Language Understanding Code
#64BLIP (ITM)
15.8
Image Score· 2023-06-02
Revisiting the Role of Language Priors in Vision-Language Models Code
#65METER
15.75
Image Score· 2023-03-25
Equivariant Similarity for Vision-Language Foundation Models Code
#66OFA tiny (TLC-A)
15.75
Image Score· 2023-05-11
Simple Token-Level Confidence Improves Caption Correctness
#67PEVL 14M
15.7
Image Score· 2023-05-12
Measuring Progress in Fine-grained Vision-and-Language Understanding Code
#68ALBEF 4M
15.5
Image Score· 2023-05-12
Measuring Progress in Fine-grained Vision-and-Language Understanding Code
#69ROSITA (Flickr30k)
15.25
Image Score
No paper
#70BLIP 129M (CapFilt/L)
15.2
Image Score· 2023-05-12
Measuring Progress in Fine-grained Vision-and-Language Understanding Code
#71BLIP 129M
15
Image Score· 2023-05-12
Measuring Progress in Fine-grained Vision-and-Language Understanding Code
#72BLIP-ViT/L 129M
14.5
Image Score· 2023-05-12
Measuring Progress in Fine-grained Vision-and-Language Understanding Code
#73OFA large (ft SNLI-VE)
14.3
Image Score· 2023-05-17
What You See is What You Read? Improving Text-Image Alignment Evaluation Code
#74UNITER large
14
Image Score· 2022-04-07
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality Code
#75ViLT (ViT-B/32)
14
Image Score· 2022-04-07
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality Code
#76CLIP (SGVL)
14
Image Score· 2023-05-10
Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs
#77PDM-CLIP (SelfEval)
14
Image Score· 2023-11-17
SelfEval: Leveraging the discriminative nature of generative models for evaluation
#78CLIP RN50x64
13.75
Image Score· 2023-05-17
What You See is What You Read? Improving Text-Image Alignment Evaluation Code
#79LDM-T5 (SelfEval)
13.5
Image Score· 2023-11-17
SelfEval: Leveraging the discriminative nature of generative models for evaluation
#80FLAVA (contrastive)
13.5
Image Score· 2022-04-07
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality Code
#81ViLLA large
13.25
Image Score· 2022-04-07
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality Code
#82UNITER base
13.25
Image Score· 2022-04-07
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality Code
#83OCLIP (ViT-H/14)
12.75
Image Score· 2023-11-17
SelfEval: Leveraging the discriminative nature of generative models for evaluation
#84TIFA
12.5
Image Score· 2023-05-17
What You See is What You Read? Improving Text-Image Alignment Evaluation Code
#85ViLLA base
12
Image Score· 2022-04-07
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality Code
#86PDM-T5 (SelfEval)
12
Image Score· 2023-11-17
SelfEval: Leveraging the discriminative nature of generative models for evaluation
#87syn-CLIP
11.5
Image Score· 2023-03-30
Going Beyond Nouns With Vision & Language Models Using Synthetic Data Code
#88COCA ViT-L14 (f.t on COCO)
11.5
Image Score· 2023-05-17
What You See is What You Read? Improving Text-Image Alignment Evaluation Code
#89InstructBLIP
11.5
Image Score· 2023-11-27
Compositional Chain-of-Thought Prompting for Large Multimodal Models Code
#90syn-CyCLIP
10.75
Image Score· 2023-03-30
Going Beyond Nouns With Vision & Language Models Using Synthetic Data Code
#91OFA base (ITM)
10.75
Image Score· 2023-05-11
Simple Token-Level Confidence Improves Caption Correctness
#92CLIP (ViT-B/32)
10.5
Image Score· 2022-04-07
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality Code
#93NegCLIP
10.5
Image Score· 2023-05-10
Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs
#94OFA large (ITM)
10.25
Image Score· 2023-05-11
Simple Token-Level Confidence Improves Caption Correctness
#95CyCLIP
9.5
Image Score· 2023-03-30
Going Beyond Nouns With Vision & Language Models Using Synthetic Data Code
#96BLIP (ITC)
9
Image Score· 2023-06-02
Revisiting the Role of Language Priors in Vision-Language Models Code
#97CLIP (ViT-L/14)
8
Image Score· 2023-11-17
SelfEval: Leveraging the discriminative nature of generative models for evaluation
#98VSE++ (COCO, ResNet)
8
Image Score· 2022-04-07
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality Code
#99MiniGPT-4-7B (BERTScore)
8
Image Score· 2023-08-21
An Examination of the Compositionality of Large Generative Vision-Language Models Code
#100OFA tiny (ITM)
7.75
Image Score· 2023-05-11
Simple Token-Level Confidence Improves Caption Correctness
#101ViLBERT base
7.25
Image Score· 2022-04-07
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality Code
#102LDM-CLIP (SelfEval)
7.25
Image Score· 2023-11-17
SelfEval: Leveraging the discriminative nature of generative models for evaluation
#103LXMERT
7
Image Score· 2022-04-07
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality Code
#104VSRN (COCO)
7
Image Score· 2022-04-07
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality Code
#105VSE++ (Flickr30k, VGG)
6.25
Image Score· 2022-04-07
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality Code
#106UniT (ITM finetuned)
6.25
Image Score· 2022-04-07
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality Code
#107VSE++ (COCO, VGG)
5.5
Image Score· 2022-04-07
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality Code
#108LLaVA-7B (BERTScore)
5.25
Image Score· 2023-08-21
An Examination of the Compositionality of Large Generative Vision-Language Models Code
#109VSRN (Flickr30k)
5
Image Score· 2022-04-07
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality Code
#110VSE++ (Flickr30k, ResNet)
5
Image Score· 2022-04-07
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality Code
#111VisualBERT base
2.5
Image Score· 2022-04-07
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality Code