TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

SotA/Reasoning/Visual Reasoning/Winoground

Visual Reasoning on Winoground

Metric: Image Score (higher is better)

LeaderboardDataset
Loading chart...

Results

Submit a result
#Model↕Image Score▼Extra DataPaperDate↕Code
1GPT-4V (CoT, pick b/w two options)68.75NoThe Role of Chain-of-Thought in Complex Vision-L...2023-11-15-
2GPT-4o + CA58.5NoA Cognitive Paradigm Approach to Probe the Perce...2025-01-23-
3OpenFlamingo + CoCoT55.25NoCoCoT: Contrastive Chain-of-Thought Prompting fo...2024-01-05Code
4MMICL + CoCoT52.5NoCoCoT: Contrastive Chain-of-Thought Prompting fo...2024-01-05Code
5GPT-4V + CoCoT49.5NoCoCoT: Contrastive Chain-of-Thought Prompting fo...2024-01-05Code
6MMICL + CCoT48NoCoCoT: Contrastive Chain-of-Thought Prompting fo...2024-01-05Code
7OpenFlamingo + DDCoT47.25NoCoCoT: Contrastive Chain-of-Thought Prompting fo...2024-01-05Code
8GPT-4V (pick b/w two options)46.25NoThe Role of Chain-of-Thought in Complex Vision-L...2023-11-15-
9MMICL + DDCoT45NoCoCoT: Contrastive Chain-of-Thought Prompting fo...2024-01-05Code
10MMICL (FLAN-T5-XXL)44.99NoMMICL: Empowering Vision-language Model with Mul...2023-09-14Code
11GPT-4V42.5NoCoCoT: Contrastive Chain-of-Thought Prompting fo...2024-01-05Code
12VQ242.2NoWhat You See is What You Read? Improving Text-Im...2023-05-17Code
13PaLI (ft SNLI-VE)41.5NoWhat You See is What You Read? Improving Text-Im...2023-05-17Code
14OpenFlamingo41.25NoCoCoT: Contrastive Chain-of-Thought Prompting fo...2024-01-05Code
15PaLI (ft SNLI-VE + Synthetic Data)38NoWhat You See is What You Read? Improving Text-Im...2023-05-17Code
16GPT-4V (image-caption match answer yes/no, zero-shot)38No---
17LLaVA-1.5-CCoT35.5NoCompositional Chain-of-Thought Prompting for Lar...2023-11-27Code
18LLaVA-1.533.3NoCompositional Chain-of-Thought Prompting for Lar...2023-11-27Code
19Gemini + CCoT33NoCoCoT: Contrastive Chain-of-Thought Prompting fo...2024-01-05Code
20Gemini + CoCoT32.5NoCoCoT: Contrastive Chain-of-Thought Prompting fo...2024-01-05Code
21FIBER (EqSim)32NoEquivariant Similarity for Vision-Language Found...2023-03-25Code
22KeyComp* (GPT-4)28.7NoPrompting Large Vision-Language Models for Compo...2024-01-20Code
23BLIP2 (SGVL)28.5NoIncorporating Structured Representations into Pr...2023-05-10-
24KeyComp* (GPT-3.5)27.8NoPrompting Large Vision-Language Models for Compo...2024-01-20Code
25OpenFlamingo + CCoT27.5NoCoCoT: Contrastive Chain-of-Thought Prompting fo...2024-01-05Code
26BLIP (SGVL)27.3NoIncorporating Structured Representations into Pr...2023-05-10-
27OFA large (TLC-A)27NoSimple Token-Level Confidence Improves Caption C...2023-05-11-
28X-VLM 4M26.7NoMeasuring Progress in Fine-grained Vision-and-La...2023-05-12Code
29FIBER (finetuned, Flickr30k)26.5NoEquivariant Similarity for Vision-Language Found...2023-03-25Code
30BLIP2 (ft COCO)26NoWhat You See is What You Read? Improving Text-Im...2023-05-17Code
31NegBLIP226NoIncorporating Structured Representations into Pr...2023-05-10-
32Gemini26NoCoCoT: Contrastive Chain-of-Thought Prompting fo...2024-01-05Code
33FIBER25.75NoEquivariant Similarity for Vision-Language Found...2023-03-25Code
34BLIP (+Graph Text, +Graph Neg)25.5NoIncorporating Structured Representations into Pr...2023-05-10-
35Gemini + DDCoT25NoCoCoT: Contrastive Chain-of-Thought Prompting fo...2024-01-05Code
36Random chance25NoWinoground: Probing Vision and Language Models f...2022-04-07Code
37LLaVA25NoIncorporating Structured Representations into Pr...2023-05-10-
38KeyComp (GPT-3.5)24.6NoPrompting Large Vision-Language Models for Compo...2024-01-20Code
39X-VLM 16M24.5NoMeasuring Progress in Fine-grained Vision-and-La...2023-05-12Code
40NegBLIP24NoIncorporating Structured Representations into Pr...2023-05-10-
41BLIP223.8NoIncorporating Structured Representations into Pr...2023-05-10-
42OFA base (TLC-A)23.5NoSimple Token-Level Confidence Improves Caption C...2023-05-11-
43METER (EqSim)22.75NoEquivariant Similarity for Vision-Language Found...2023-03-25Code
44LLaVA-1.5-ZS-CoT22.5NoCompositional Chain-of-Thought Prompting for Lar...2023-11-27Code
45IDEFICS 80B22.5No---
46MiniGPT-4-7B (GPTScore)21.75NoAn Examination of the Compositionality of Large ...2023-08-21Code
47BLIP (VisualGPTScore, α-tuned)21.5NoRevisiting the Role of Language Priors in Vision...2023-06-02Code
48InstructBLIP-CCoT 21.3NoCompositional Chain-of-Thought Prompting for Lar...2023-11-27Code
49IDEFICS 9B20.8No---
50METER (finetuned, Flickr30k)20.75NoEquivariant Similarity for Vision-Language Found...2023-03-25Code
51BLIP (+Graph Text)20.5NoIncorporating Structured Representations into Pr...2023-05-10-
52FLAVA (ITM)20.5NoWinoground: Probing Vision and Language Models f...2022-04-07Code
53IAIS large (Flickr30k)19.75No---
54IAIS large (COCO)19.75No---
55BLIP19.2NoIncorporating Structured Representations into Pr...2023-05-10-
56BLIP 14M18.5NoMeasuring Progress in Fine-grained Vision-and-La...2023-05-12Code
57MiniGPT-418NoIncorporating Structured Representations into Pr...2023-05-10-
58MiniGPT-4-7B (VisualGPTScore)18NoAn Examination of the Compositionality of Large ...2023-08-21Code
59CACR base17.75No---
60VinVL17.75NoWinoground: Probing Vision and Language Models f...2022-04-07Code
61LLaVA-7B (GPTScore)17NoAn Examination of the Compositionality of Large ...2023-08-21Code
62InstructBLIP-ZS-CoT16.3NoCompositional Chain-of-Thought Prompting for Lar...2023-11-27Code
63ALBEF 14M16.2NoMeasuring Progress in Fine-grained Vision-and-La...2023-05-12Code
64BLIP (ITM)15.8NoRevisiting the Role of Language Priors in Vision...2023-06-02Code
65METER15.75NoEquivariant Similarity for Vision-Language Found...2023-03-25Code
66OFA tiny (TLC-A)15.75NoSimple Token-Level Confidence Improves Caption C...2023-05-11-
67PEVL 14M15.7NoMeasuring Progress in Fine-grained Vision-and-La...2023-05-12Code
68ALBEF 4M15.5NoMeasuring Progress in Fine-grained Vision-and-La...2023-05-12Code
69ROSITA (Flickr30k)15.25No---
70BLIP 129M (CapFilt/L)15.2NoMeasuring Progress in Fine-grained Vision-and-La...2023-05-12Code
71BLIP 129M15NoMeasuring Progress in Fine-grained Vision-and-La...2023-05-12Code
72BLIP-ViT/L 129M14.5NoMeasuring Progress in Fine-grained Vision-and-La...2023-05-12Code
73OFA large (ft SNLI-VE)14.3NoWhat You See is What You Read? Improving Text-Im...2023-05-17Code
74UNITER large14NoWinoground: Probing Vision and Language Models f...2022-04-07Code
75ViLT (ViT-B/32)14NoWinoground: Probing Vision and Language Models f...2022-04-07Code
76CLIP (SGVL)14NoIncorporating Structured Representations into Pr...2023-05-10-
77PDM-CLIP (SelfEval)14NoSelfEval: Leveraging the discriminative nature o...2023-11-17-
78CLIP RN50x6413.75NoWhat You See is What You Read? Improving Text-Im...2023-05-17Code
79LDM-T5 (SelfEval)13.5NoSelfEval: Leveraging the discriminative nature o...2023-11-17-
80FLAVA (contrastive)13.5NoWinoground: Probing Vision and Language Models f...2022-04-07Code
81ViLLA large13.25NoWinoground: Probing Vision and Language Models f...2022-04-07Code
82UNITER base13.25NoWinoground: Probing Vision and Language Models f...2022-04-07Code
83OCLIP (ViT-H/14) 12.75NoSelfEval: Leveraging the discriminative nature o...2023-11-17-
84TIFA12.5NoWhat You See is What You Read? Improving Text-Im...2023-05-17Code
85ViLLA base12NoWinoground: Probing Vision and Language Models f...2022-04-07Code
86PDM-T5 (SelfEval)12NoSelfEval: Leveraging the discriminative nature o...2023-11-17-
87syn-CLIP11.5NoGoing Beyond Nouns With Vision & Language Models...2023-03-30Code
88COCA ViT-L14 (f.t on COCO)11.5NoWhat You See is What You Read? Improving Text-Im...2023-05-17Code
89InstructBLIP11.5NoCompositional Chain-of-Thought Prompting for Lar...2023-11-27Code
90syn-CyCLIP10.75NoGoing Beyond Nouns With Vision & Language Models...2023-03-30Code
91OFA base (ITM)10.75NoSimple Token-Level Confidence Improves Caption C...2023-05-11-
92CLIP (ViT-B/32)10.5NoWinoground: Probing Vision and Language Models f...2022-04-07Code
93NegCLIP10.5NoIncorporating Structured Representations into Pr...2023-05-10-
94OFA large (ITM)10.25NoSimple Token-Level Confidence Improves Caption C...2023-05-11-
95CyCLIP9.5NoGoing Beyond Nouns With Vision & Language Models...2023-03-30Code
96BLIP (ITC)9NoRevisiting the Role of Language Priors in Vision...2023-06-02Code
97CLIP (ViT-L/14)8NoSelfEval: Leveraging the discriminative nature o...2023-11-17-
98VSE++ (COCO, ResNet)8NoWinoground: Probing Vision and Language Models f...2022-04-07Code
99MiniGPT-4-7B (BERTScore)8NoAn Examination of the Compositionality of Large ...2023-08-21Code
100OFA tiny (ITM)7.75NoSimple Token-Level Confidence Improves Caption C...2023-05-11-
101ViLBERT base7.25NoWinoground: Probing Vision and Language Models f...2022-04-07Code
102LDM-CLIP (SelfEval)7.25NoSelfEval: Leveraging the discriminative nature o...2023-11-17-
103LXMERT7NoWinoground: Probing Vision and Language Models f...2022-04-07Code
104VSRN (COCO)7NoWinoground: Probing Vision and Language Models f...2022-04-07Code
105VSE++ (Flickr30k, VGG)6.25NoWinoground: Probing Vision and Language Models f...2022-04-07Code
106UniT (ITM finetuned)6.25NoWinoground: Probing Vision and Language Models f...2022-04-07Code
107VSE++ (COCO, VGG)5.5NoWinoground: Probing Vision and Language Models f...2022-04-07Code
108LLaVA-7B (BERTScore)5.25NoAn Examination of the Compositionality of Large ...2023-08-21Code
109VSRN (Flickr30k)5NoWinoground: Probing Vision and Language Models f...2022-04-07Code
110VSE++ (Flickr30k, ResNet)5NoWinoground: Probing Vision and Language Models f...2022-04-07Code
111VisualBERT base2.5NoWinoground: Probing Vision and Language Models f...2022-04-07Code