Tasks
SotA
Datasets
Papers
Methods
Submit
About
SotA
/
Reasoning
/
Visual Reasoning
/
Winoground
Visual Reasoning on Winoground
Metric: Group Score (higher is better)
Leaderboard
Dataset
Loading chart...
Results
Submit a result
Export CSV
Sort:
Group Score (best first)
Group Score (worst first)
Date (newest first)
Date (oldest first)
Model name (A→Z)
#
Model
↕
Group Score
▼
Extra Data
Paper
Date
↕
Code
1
GPT-4V (CoT, pick b/w two options)
58.75
No
The Role of Chain-of-Thought in Complex Vision-L...
2023-11-15
-
2
GPT-4o + CA
52
No
A Cognitive Paradigm Approach to Probe the Perce...
2025-01-23
-
3
MMICL + CoCoT
50.75
No
CoCoT: Contrastive Chain-of-Thought Prompting fo...
2024-01-05
Code
4
MMICL + CCoT
47.5
No
CoCoT: Contrastive Chain-of-Thought Prompting fo...
2024-01-05
Code
5
GPT-4V + CoCoT
44.5
No
CoCoT: Contrastive Chain-of-Thought Prompting fo...
2024-01-05
Code
6
MMICL (FLAN-T5-XXL)
43
No
MMICL: Empowering Vision-language Model with Mul...
2023-09-14
Code
7
OpenFlamingo + CoCoT
41.5
No
CoCoT: Contrastive Chain-of-Thought Prompting fo...
2024-01-05
Code
8
GPT-4V (pick b/w two options)
39.25
No
The Role of Chain-of-Thought in Complex Vision-L...
2023-11-15
-
9
OpenFlamingo + DDCoT
39
No
CoCoT: Contrastive Chain-of-Thought Prompting fo...
2024-01-05
Code
10
GPT-4V (image-caption match answer yes/no, zero-shot)
38
No
-
-
-
11
GPT-4V
37.75
No
CoCoT: Contrastive Chain-of-Thought Prompting fo...
2024-01-05
Code
12
MMICL + DDCoT
36.75
No
CoCoT: Contrastive Chain-of-Thought Prompting fo...
2024-01-05
Code
13
OpenFlamingo
33.25
No
CoCoT: Contrastive Chain-of-Thought Prompting fo...
2024-01-05
Code
14
VQ2
30.5
No
What You See is What You Read? Improving Text-Im...
2023-05-17
Code
15
PaLI (ft SNLI-VE + Synthetic Data)
28.75
No
What You See is What You Read? Improving Text-Im...
2023-05-17
Code
16
PaLI (ft SNLI-VE)
28.7
No
What You See is What You Read? Improving Text-Im...
2023-05-17
Code
17
Gemini + CoCoT
27.75
No
CoCoT: Contrastive Chain-of-Thought Prompting fo...
2024-01-05
Code
18
FIBER (EqSim)
27.5
No
Equivariant Similarity for Vision-Language Found...
2023-03-25
Code
19
Gemini
25
No
CoCoT: Contrastive Chain-of-Thought Prompting fo...
2024-01-05
Code
20
Gemini + DDCoT
23.75
No
CoCoT: Contrastive Chain-of-Thought Prompting fo...
2024-01-05
Code
21
BLIP2 (ft COCO)
23.5
No
What You See is What You Read? Improving Text-Im...
2023-05-17
Code
22
BLIP2 (SGVL)
23.3
No
Incorporating Structured Representations into Pr...
2023-05-10
-
23
FIBER (finetuned, Flickr30k)
23
No
Equivariant Similarity for Vision-Language Found...
2023-03-25
Code
24
LLaVA-1.5-CCoT
22.3
No
Compositional Chain-of-Thought Prompting for Lar...
2023-11-27
Code
25
FIBER
22.25
No
Equivariant Similarity for Vision-Language Found...
2023-03-25
Code
26
X-VLM 4M
21.5
No
Measuring Progress in Fine-grained Vision-and-La...
2023-05-12
Code
27
BLIP (SGVL)
21.5
No
Incorporating Structured Representations into Pr...
2023-05-10
-
28
X-VLM 16M
21.2
No
Measuring Progress in Fine-grained Vision-and-La...
2023-05-12
Code
29
Gemini + CCoT
20.75
No
CoCoT: Contrastive Chain-of-Thought Prompting fo...
2024-01-05
Code
30
NegBLIP2
20.5
No
Incorporating Structured Representations into Pr...
2023-05-10
-
31
LLaVA-1.5
20.1
No
Compositional Chain-of-Thought Prompting for Lar...
2023-11-27
Code
32
OpenFlamingo + CCoT
20
No
CoCoT: Contrastive Chain-of-Thought Prompting fo...
2024-01-05
Code
33
BLIP2
19
No
Incorporating Structured Representations into Pr...
2023-05-10
-
34
BLIP (+Graph Text, +Graph Neg)
19
No
Incorporating Structured Representations into Pr...
2023-05-10
-
35
METER (EqSim)
18.75
No
Equivariant Similarity for Vision-Language Found...
2023-03-25
Code
36
NegBLIP
18.5
No
Incorporating Structured Representations into Pr...
2023-05-10
-
37
KeyComp* (GPT-4)
18.2
No
Prompting Large Vision-Language Models for Compo...
2024-01-20
Code
38
OFA large (TLC-A)
17.5
No
Simple Token-Level Confidence Improves Caption C...
2023-05-11
-
39
KeyComp* (GPT-3.5)
17.4
No
Prompting Large Vision-Language Models for Compo...
2024-01-20
Code
40
BLIP (VisualGPTScore, α-tuned)
16.8
No
Revisiting the Role of Language Priors in Vision...
2023-06-02
Code
41
Random chance
16.67
No
Winoground: Probing Vision and Language Models f...
2022-04-07
Code
42
BLIP (+Graph Text)
16.5
No
Incorporating Structured Representations into Pr...
2023-05-10
-
43
IAIS large (Flickr30k)
16
No
-
-
-
44
IAIS large (COCO)
15.5
No
-
-
-
45
BLIP
15
No
Incorporating Structured Representations into Pr...
2023-05-10
-
46
METER (finetuned, Flickr30k)
14.75
No
Equivariant Similarity for Vision-Language Found...
2023-03-25
Code
47
VinVL
14.5
No
Winoground: Probing Vision and Language Models f...
2022-04-07
Code
48
BLIP 14M
14.5
No
Measuring Progress in Fine-grained Vision-and-La...
2023-05-12
Code
49
CACR base
14.25
No
-
-
-
50
FLAVA (ITM)
14.25
No
Winoground: Probing Vision and Language Models f...
2022-04-07
Code
51
OFA base (TLC-A)
13.75
No
Simple Token-Level Confidence Improves Caption C...
2023-05-11
-
52
BLIP (ITM)
13.3
No
Revisiting the Role of Language Priors in Vision...
2023-06-02
Code
53
LLaVA
13
No
Incorporating Structured Representations into Pr...
2023-05-10
-
54
ALBEF 14M
12.7
No
Measuring Progress in Fine-grained Vision-and-La...
2023-05-12
Code
55
KeyComp (GPT-3.5)
12.4
No
Prompting Large Vision-Language Models for Compo...
2024-01-20
Code
56
LLaVA-1.5-ZS-CoT
12.3
No
Compositional Chain-of-Thought Prompting for Lar...
2023-11-27
Code
57
ROSITA (Flickr30k)
12.25
No
-
-
-
58
BLIP 129M (CapFilt/L)
12.2
No
Measuring Progress in Fine-grained Vision-and-La...
2023-05-12
Code
59
BLIP-ViT/L 129M
12.2
No
Measuring Progress in Fine-grained Vision-and-La...
2023-05-12
Code
60
PEVL 14M
12.2
No
Measuring Progress in Fine-grained Vision-and-La...
2023-05-12
Code
61
METER
12
No
Equivariant Similarity for Vision-Language Found...
2023-03-25
Code
62
BLIP 129M
11.7
No
Measuring Progress in Fine-grained Vision-and-La...
2023-05-12
Code
63
MiniGPT-4-7B (GPTScore)
11.5
No
An Examination of the Compositionality of Large ...
2023-08-21
Code
64
TIFA
11.3
No
What You See is What You Read? Improving Text-Im...
2023-05-17
Code
65
ViLLA large
11
No
Winoground: Probing Vision and Language Models f...
2022-04-07
Code
66
ALBEF 4M
11
No
Measuring Progress in Fine-grained Vision-and-La...
2023-05-12
Code
67
UNITER large
10.5
No
Winoground: Probing Vision and Language Models f...
2022-04-07
Code
68
LLaVA-7B (GPTScore)
10.5
No
An Examination of the Compositionality of Large ...
2023-08-21
Code
69
CLIP RN50x64
10.25
No
What You See is What You Read? Improving Text-Im...
2023-05-17
Code
70
UNITER base
10
No
Winoground: Probing Vision and Language Models f...
2022-04-07
Code
71
CLIP (SGVL)
9.8
No
Incorporating Structured Representations into Pr...
2023-05-10
-
72
syn-CLIP
9.5
No
Going Beyond Nouns With Vision & Language Models...
2023-03-30
Code
73
MiniGPT-4
9.5
No
Incorporating Structured Representations into Pr...
2023-05-10
-
74
MiniGPT-4-7B (VisualGPTScore)
9.5
No
An Examination of the Compositionality of Large ...
2023-08-21
Code
75
ViLT (ViT-B/32)
9.25
No
Winoground: Probing Vision and Language Models f...
2022-04-07
Code
76
OFA large (ft SNLI-VE)
9
No
What You See is What You Read? Improving Text-Im...
2023-05-17
Code
77
FLAVA (contrastive)
9
No
Winoground: Probing Vision and Language Models f...
2022-04-07
Code
78
InstructBLIP-CCoT
8.3
No
Compositional Chain-of-Thought Prompting for Lar...
2023-11-27
Code
79
syn-CyCLIP
8.25
No
Going Beyond Nouns With Vision & Language Models...
2023-03-30
Code
80
COCA ViT-L14 (f.t on COCO)
8.25
No
What You See is What You Read? Improving Text-Im...
2023-05-17
Code
81
CLIP (ViT-B/32)
8
No
Winoground: Probing Vision and Language Models f...
2022-04-07
Code
82
ViLLA base
8
No
Winoground: Probing Vision and Language Models f...
2022-04-07
Code
83
NegCLIP
8
No
Incorporating Structured Representations into Pr...
2023-05-10
-
84
IDEFICS 80B
8
No
-
-
-
85
OFA large (ITM)
7.25
No
Simple Token-Level Confidence Improves Caption C...
2023-05-11
-
86
CyCLIP
7.25
No
Going Beyond Nouns With Vision & Language Models...
2023-03-30
Code
87
OFA tiny (TLC-A)
6.75
No
Simple Token-Level Confidence Improves Caption C...
2023-05-11
-
88
BLIP (ITC)
6.5
No
Revisiting the Role of Language Priors in Vision...
2023-06-02
Code
89
OFA base (ITM)
6.5
No
Simple Token-Level Confidence Improves Caption C...
2023-05-11
-
90
IDEFICS 9B
5
No
-
-
-
91
ViLBERT base
4.75
No
Winoground: Probing Vision and Language Models f...
2022-04-07
Code
92
OFA tiny (ITM)
4.5
No
Simple Token-Level Confidence Improves Caption C...
2023-05-11
-
93
VSE++ (Flickr30k, VGG)
4.5
No
Winoground: Probing Vision and Language Models f...
2022-04-07
Code
94
VSE++ (COCO, ResNet)
4
No
Winoground: Probing Vision and Language Models f...
2022-04-07
Code
95
UniT (ITM finetuned)
4
No
Winoground: Probing Vision and Language Models f...
2022-04-07
Code
96
LXMERT
4
No
Winoground: Probing Vision and Language Models f...
2022-04-07
Code
97
InstructBLIP-ZS-CoT
4
No
Compositional Chain-of-Thought Prompting for Lar...
2023-11-27
Code
98
VSRN (COCO)
3.75
No
Winoground: Probing Vision and Language Models f...
2022-04-07
Code
99
VSRN (Flickr30k)
3.5
No
Winoground: Probing Vision and Language Models f...
2022-04-07
Code
100
VSE++ (COCO, VGG)
3.5
No
Winoground: Probing Vision and Language Models f...
2022-04-07
Code
101
InstructBLIP
3.3
No
Compositional Chain-of-Thought Prompting for Lar...
2023-11-27
Code
102
VSE++ (Flickr30k, ResNet)
2.75
No
Winoground: Probing Vision and Language Models f...
2022-04-07
Code
103
MiniGPT-4-7B (BERTScore)
2.75
No
An Examination of the Compositionality of Large ...
2023-08-21
Code
104
LLaVA-7B (BERTScore)
2.25
No
An Examination of the Compositionality of Large ...
2023-08-21
Code
105
VisualBERT base
1.5
No
Winoground: Probing Vision and Language Models f...
2022-04-07
Code
#1
GPT-4V (CoT, pick b/w two options)
SOTA
58.75
Group Score
· 2023-11-15
The Role of Chain-of-Thought in Complex Vision-Language Reasoning Task
#2
GPT-4o + CA
52
Group Score
· 2025-01-23
A Cognitive Paradigm Approach to Probe the Perception-Reasoning Interface in VLMs
#3
MMICL + CoCoT
50.75
Group Score
· 2024-01-05
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs
Code
#4
MMICL + CCoT
47.5
Group Score
· 2024-01-05
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs
Code
#5
GPT-4V + CoCoT
44.5
Group Score
· 2024-01-05
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs
Code
#6
MMICL (FLAN-T5-XXL)
SOTA
43
Group Score
· 2023-09-14
MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning
Code
#7
OpenFlamingo + CoCoT
41.5
Group Score
· 2024-01-05
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs
Code
#8
GPT-4V (pick b/w two options)
39.25
Group Score
· 2023-11-15
The Role of Chain-of-Thought in Complex Vision-Language Reasoning Task
#9
OpenFlamingo + DDCoT
39
Group Score
· 2024-01-05
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs
Code
#10
GPT-4V (image-caption match answer yes/no, zero-shot)
38
Group Score
No paper
#11
GPT-4V
37.75
Group Score
· 2024-01-05
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs
Code
#12
MMICL + DDCoT
36.75
Group Score
· 2024-01-05
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs
Code
#13
OpenFlamingo
33.25
Group Score
· 2024-01-05
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs
Code
#14
VQ2
SOTA
30.5
Group Score
· 2023-05-17
What You See is What You Read? Improving Text-Image Alignment Evaluation
Code
#15
PaLI (ft SNLI-VE + Synthetic Data)
28.75
Group Score
· 2023-05-17
What You See is What You Read? Improving Text-Image Alignment Evaluation
Code
#16
PaLI (ft SNLI-VE)
28.7
Group Score
· 2023-05-17
What You See is What You Read? Improving Text-Image Alignment Evaluation
Code
#17
Gemini + CoCoT
27.75
Group Score
· 2024-01-05
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs
Code
#18
FIBER (EqSim)
SOTA
27.5
Group Score
· 2023-03-25
Equivariant Similarity for Vision-Language Foundation Models
Code
#19
Gemini
25
Group Score
· 2024-01-05
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs
Code
#20
Gemini + DDCoT
23.75
Group Score
· 2024-01-05
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs
Code
#21
BLIP2 (ft COCO)
23.5
Group Score
· 2023-05-17
What You See is What You Read? Improving Text-Image Alignment Evaluation
Code
#22
BLIP2 (SGVL)
23.3
Group Score
· 2023-05-10
Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs
#23
FIBER (finetuned, Flickr30k)
23
Group Score
· 2023-03-25
Equivariant Similarity for Vision-Language Foundation Models
Code
#24
LLaVA-1.5-CCoT
22.3
Group Score
· 2023-11-27
Compositional Chain-of-Thought Prompting for Large Multimodal Models
Code
#25
FIBER
22.25
Group Score
· 2023-03-25
Equivariant Similarity for Vision-Language Foundation Models
Code
#26
X-VLM 4M
21.5
Group Score
· 2023-05-12
Measuring Progress in Fine-grained Vision-and-Language Understanding
Code
#27
BLIP (SGVL)
21.5
Group Score
· 2023-05-10
Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs
#28
X-VLM 16M
21.2
Group Score
· 2023-05-12
Measuring Progress in Fine-grained Vision-and-Language Understanding
Code
#29
Gemini + CCoT
20.75
Group Score
· 2024-01-05
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs
Code
#30
NegBLIP2
20.5
Group Score
· 2023-05-10
Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs
#31
LLaVA-1.5
20.1
Group Score
· 2023-11-27
Compositional Chain-of-Thought Prompting for Large Multimodal Models
Code
#32
OpenFlamingo + CCoT
20
Group Score
· 2024-01-05
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs
Code
#33
BLIP2
19
Group Score
· 2023-05-10
Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs
#34
BLIP (+Graph Text, +Graph Neg)
19
Group Score
· 2023-05-10
Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs
#35
METER (EqSim)
18.75
Group Score
· 2023-03-25
Equivariant Similarity for Vision-Language Foundation Models
Code
#36
NegBLIP
18.5
Group Score
· 2023-05-10
Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs
#37
KeyComp* (GPT-4)
18.2
Group Score
· 2024-01-20
Prompting Large Vision-Language Models for Compositional Reasoning
Code
#38
OFA large (TLC-A)
17.5
Group Score
· 2023-05-11
Simple Token-Level Confidence Improves Caption Correctness
#39
KeyComp* (GPT-3.5)
17.4
Group Score
· 2024-01-20
Prompting Large Vision-Language Models for Compositional Reasoning
Code
#40
BLIP (VisualGPTScore, α-tuned)
16.8
Group Score
· 2023-06-02
Revisiting the Role of Language Priors in Vision-Language Models
Code
#41
Random chance
SOTA
16.67
Group Score
· 2022-04-07
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality
Code
#42
BLIP (+Graph Text)
16.5
Group Score
· 2023-05-10
Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs
#43
IAIS large (Flickr30k)
16
Group Score
No paper
#44
IAIS large (COCO)
15.5
Group Score
No paper
#45
BLIP
15
Group Score
· 2023-05-10
Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs
#46
METER (finetuned, Flickr30k)
14.75
Group Score
· 2023-03-25
Equivariant Similarity for Vision-Language Foundation Models
Code
#47
VinVL
14.5
Group Score
· 2022-04-07
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality
Code
#48
BLIP 14M
14.5
Group Score
· 2023-05-12
Measuring Progress in Fine-grained Vision-and-Language Understanding
Code
#49
CACR base
14.25
Group Score
No paper
#50
FLAVA (ITM)
14.25
Group Score
· 2022-04-07
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality
Code
#51
OFA base (TLC-A)
13.75
Group Score
· 2023-05-11
Simple Token-Level Confidence Improves Caption Correctness
#52
BLIP (ITM)
13.3
Group Score
· 2023-06-02
Revisiting the Role of Language Priors in Vision-Language Models
Code
#53
LLaVA
13
Group Score
· 2023-05-10
Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs
#54
ALBEF 14M
12.7
Group Score
· 2023-05-12
Measuring Progress in Fine-grained Vision-and-Language Understanding
Code
#55
KeyComp (GPT-3.5)
12.4
Group Score
· 2024-01-20
Prompting Large Vision-Language Models for Compositional Reasoning
Code
#56
LLaVA-1.5-ZS-CoT
12.3
Group Score
· 2023-11-27
Compositional Chain-of-Thought Prompting for Large Multimodal Models
Code
#57
ROSITA (Flickr30k)
12.25
Group Score
No paper
#58
BLIP 129M (CapFilt/L)
12.2
Group Score
· 2023-05-12
Measuring Progress in Fine-grained Vision-and-Language Understanding
Code
#59
BLIP-ViT/L 129M
12.2
Group Score
· 2023-05-12
Measuring Progress in Fine-grained Vision-and-Language Understanding
Code
#60
PEVL 14M
12.2
Group Score
· 2023-05-12
Measuring Progress in Fine-grained Vision-and-Language Understanding
Code
#61
METER
12
Group Score
· 2023-03-25
Equivariant Similarity for Vision-Language Foundation Models
Code
#62
BLIP 129M
11.7
Group Score
· 2023-05-12
Measuring Progress in Fine-grained Vision-and-Language Understanding
Code
#63
MiniGPT-4-7B (GPTScore)
11.5
Group Score
· 2023-08-21
An Examination of the Compositionality of Large Generative Vision-Language Models
Code
#64
TIFA
11.3
Group Score
· 2023-05-17
What You See is What You Read? Improving Text-Image Alignment Evaluation
Code
#65
ViLLA large
11
Group Score
· 2022-04-07
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality
Code
#66
ALBEF 4M
11
Group Score
· 2023-05-12
Measuring Progress in Fine-grained Vision-and-Language Understanding
Code
#67
UNITER large
10.5
Group Score
· 2022-04-07
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality
Code
#68
LLaVA-7B (GPTScore)
10.5
Group Score
· 2023-08-21
An Examination of the Compositionality of Large Generative Vision-Language Models
Code
#69
CLIP RN50x64
10.25
Group Score
· 2023-05-17
What You See is What You Read? Improving Text-Image Alignment Evaluation
Code
#70
UNITER base
10
Group Score
· 2022-04-07
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality
Code
#71
CLIP (SGVL)
9.8
Group Score
· 2023-05-10
Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs
#72
syn-CLIP
9.5
Group Score
· 2023-03-30
Going Beyond Nouns With Vision & Language Models Using Synthetic Data
Code
#73
MiniGPT-4
9.5
Group Score
· 2023-05-10
Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs
#74
MiniGPT-4-7B (VisualGPTScore)
9.5
Group Score
· 2023-08-21
An Examination of the Compositionality of Large Generative Vision-Language Models
Code
#75
ViLT (ViT-B/32)
9.25
Group Score
· 2022-04-07
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality
Code
#76
OFA large (ft SNLI-VE)
9
Group Score
· 2023-05-17
What You See is What You Read? Improving Text-Image Alignment Evaluation
Code
#77
FLAVA (contrastive)
9
Group Score
· 2022-04-07
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality
Code
#78
InstructBLIP-CCoT
8.3
Group Score
· 2023-11-27
Compositional Chain-of-Thought Prompting for Large Multimodal Models
Code
#79
syn-CyCLIP
8.25
Group Score
· 2023-03-30
Going Beyond Nouns With Vision & Language Models Using Synthetic Data
Code
#80
COCA ViT-L14 (f.t on COCO)
8.25
Group Score
· 2023-05-17
What You See is What You Read? Improving Text-Image Alignment Evaluation
Code
#81
CLIP (ViT-B/32)
8
Group Score
· 2022-04-07
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality
Code
#82
ViLLA base
8
Group Score
· 2022-04-07
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality
Code
#83
NegCLIP
8
Group Score
· 2023-05-10
Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs
#84
IDEFICS 80B
8
Group Score
No paper
#85
OFA large (ITM)
7.25
Group Score
· 2023-05-11
Simple Token-Level Confidence Improves Caption Correctness
#86
CyCLIP
7.25
Group Score
· 2023-03-30
Going Beyond Nouns With Vision & Language Models Using Synthetic Data
Code
#87
OFA tiny (TLC-A)
6.75
Group Score
· 2023-05-11
Simple Token-Level Confidence Improves Caption Correctness
#88
BLIP (ITC)
6.5
Group Score
· 2023-06-02
Revisiting the Role of Language Priors in Vision-Language Models
Code
#89
OFA base (ITM)
6.5
Group Score
· 2023-05-11
Simple Token-Level Confidence Improves Caption Correctness
#90
IDEFICS 9B
5
Group Score
No paper
#91
ViLBERT base
4.75
Group Score
· 2022-04-07
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality
Code
#92
OFA tiny (ITM)
4.5
Group Score
· 2023-05-11
Simple Token-Level Confidence Improves Caption Correctness
#93
VSE++ (Flickr30k, VGG)
4.5
Group Score
· 2022-04-07
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality
Code
#94
VSE++ (COCO, ResNet)
4
Group Score
· 2022-04-07
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality
Code
#95
UniT (ITM finetuned)
4
Group Score
· 2022-04-07
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality
Code
#96
LXMERT
4
Group Score
· 2022-04-07
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality
Code
#97
InstructBLIP-ZS-CoT
4
Group Score
· 2023-11-27
Compositional Chain-of-Thought Prompting for Large Multimodal Models
Code
#98
VSRN (COCO)
3.75
Group Score
· 2022-04-07
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality
Code
#99
VSRN (Flickr30k)
3.5
Group Score
· 2022-04-07
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality
Code
#100
VSE++ (COCO, VGG)
3.5
Group Score
· 2022-04-07
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality
Code
#101
InstructBLIP
3.3
Group Score
· 2023-11-27
Compositional Chain-of-Thought Prompting for Large Multimodal Models
Code
#102
VSE++ (Flickr30k, ResNet)
2.75
Group Score
· 2022-04-07
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality
Code
#103
MiniGPT-4-7B (BERTScore)
2.75
Group Score
· 2023-08-21
An Examination of the Compositionality of Large Generative Vision-Language Models
Code
#104
LLaVA-7B (BERTScore)
2.25
Group Score
· 2023-08-21
An Examination of the Compositionality of Large Generative Vision-Language Models
Code
#105
VisualBERT base
1.5
Group Score
· 2022-04-07
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality
Code