TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

SotA/Reasoning/Visual Reasoning/Winoground

Visual Reasoning on Winoground

Metric: Text Score (higher is better)

LeaderboardDataset
Loading chart...

Results

Submit a result
#Model↕Text Score▼Extra DataPaperDate↕Code
1GPT-4o + CA75.5NoA Cognitive Paradigm Approach to Probe the Perce...2025-01-23-
2GPT-4V (CoT, pick b/w two options)75.25NoThe Role of Chain-of-Thought in Complex Vision-L...2023-11-15-
3GPT-4V (pick b/w two options)69.25NoThe Role of Chain-of-Thought in Complex Vision-L...2023-11-15-
4MMICL + CoCoT64.25NoCoCoT: Contrastive Chain-of-Thought Prompting fo...2024-01-05Code
5GPT-4V + CoCoT58.5NoCoCoT: Contrastive Chain-of-Thought Prompting fo...2024-01-05Code
6OpenFlamingo + CoCoT58.25NoCoCoT: Contrastive Chain-of-Thought Prompting fo...2024-01-05Code
7GPT-4V54.5NoCoCoT: Contrastive Chain-of-Thought Prompting fo...2024-01-05Code
8FIBER (EqSim)51.5NoEquivariant Similarity for Vision-Language Found...2023-03-25Code
9FIBER (finetuned, Flickr30k)51.25NoEquivariant Similarity for Vision-Language Found...2023-03-25Code
10MMICL + CCoT51NoCoCoT: Contrastive Chain-of-Thought Prompting fo...2024-01-05Code
11OpenFlamingo + DDCoT47.5NoCoCoT: Contrastive Chain-of-Thought Prompting fo...2024-01-05Code
12VQ247NoWhat You See is What You Read? Improving Text-Im...2023-05-17Code
13MMICL + DDCoT46.75NoCoCoT: Contrastive Chain-of-Thought Prompting fo...2024-01-05Code
14X-VLM 16M46.7NoMeasuring Progress in Fine-grained Vision-and-La...2023-05-12Code
15PaLI (ft SNLI-VE + Synthetic Data)46.5NoWhat You See is What You Read? Improving Text-Im...2023-05-17Code
16FIBER46.25NoEquivariant Similarity for Vision-Language Found...2023-03-25Code
17MMICL (FLAN-T5-XXL)45.5NoMMICL: Empowering Vision-language Model with Mul...2023-09-14Code
18PaLI (ft SNLI-VE)45NoWhat You See is What You Read? Improving Text-Im...2023-05-17Code
19Gemini + DDCoT45NoCoCoT: Contrastive Chain-of-Thought Prompting fo...2024-01-05Code
20METER (EqSim)45NoEquivariant Similarity for Vision-Language Found...2023-03-25Code
21X-VLM 4M44NoMeasuring Progress in Fine-grained Vision-and-La...2023-05-12Code
22BLIP2 (ft COCO)44NoWhat You See is What You Read? Improving Text-Im...2023-05-17Code
23KeyComp* (GPT-4)43.5NoPrompting Large Vision-Language Models for Compo...2024-01-20Code
24METER (finetuned, Flickr30k)43.5NoEquivariant Similarity for Vision-Language Found...2023-03-25Code
25BLIP2 (SGVL)42.8NoIncorporating Structured Representations into Pr...2023-05-10-
26BLIP (SGVL)42.8NoIncorporating Structured Representations into Pr...2023-05-10-
27KeyComp* (GPT-3.5)42.7NoPrompting Large Vision-Language Models for Compo...2024-01-20Code
28OpenFlamingo + CCoT42.5NoCoCoT: Contrastive Chain-of-Thought Prompting fo...2024-01-05Code
29NegBLIP42.5NoIncorporating Structured Representations into Pr...2023-05-10-
30IAIS large (Flickr30k)42.5No---
31LLaVA-1.5-CCoT42NoCompositional Chain-of-Thought Prompting for Lar...2023-11-27Code
32BLIP242NoIncorporating Structured Representations into Pr...2023-05-10-
33IAIS large (COCO)41.75No---
34NegBLIP241.5NoIncorporating Structured Representations into Pr...2023-05-10-
35BLIP (+Graph Text, +Graph Neg)40.5NoIncorporating Structured Representations into Pr...2023-05-10-
36BLIP (+Graph Text)40.3NoIncorporating Structured Representations into Pr...2023-05-10-
37Gemini + CoCoT40NoCoCoT: Contrastive Chain-of-Thought Prompting fo...2024-01-05Code
38CACR base39.25No---
39METER39.25NoEquivariant Similarity for Vision-Language Found...2023-03-25Code
40OpenFlamingo39NoCoCoT: Contrastive Chain-of-Thought Prompting fo...2024-01-05Code
41BLIP39NoIncorporating Structured Representations into Pr...2023-05-10-
42GPT-4V (image-caption match answer yes/no, zero-shot)38No---
43UNITER large38NoWinoground: Probing Vision and Language Models f...2022-04-07Code
44VinVL37.75NoWinoground: Probing Vision and Language Models f...2022-04-07Code
45ViLLA large37NoWinoground: Probing Vision and Language Models f...2022-04-07Code
46BLIP (VisualGPTScore, α-tuned)36.5NoRevisiting the Role of Language Priors in Vision...2023-06-02Code
47BLIP 14M36.5NoMeasuring Progress in Fine-grained Vision-and-La...2023-05-12Code
48ViT-B/16 + BERT base + ViLEM36.5No---
49LLaVA-1.536NoCompositional Chain-of-Thought Prompting for Lar...2023-11-27Code
50BLIP (ITM)35.8NoRevisiting the Role of Language Priors in Vision...2023-06-02Code
51BLIP 129M35.5NoMeasuring Progress in Fine-grained Vision-and-La...2023-05-12Code
52ROSITA (Flickr30k)35.25No---
53ViLT (ViT-B/32)34.75NoWinoground: Probing Vision and Language Models f...2022-04-07Code
54BLIP 129M (CapFilt/L)34.7NoMeasuring Progress in Fine-grained Vision-and-La...2023-05-12Code
55BLIP-ViT/L 129M34.7NoMeasuring Progress in Fine-grained Vision-and-La...2023-05-12Code
56Diffusion Classifier (zero-shot)34NoYour Diffusion Model is Secretly a Zero-Shot Cla...2023-03-28Code
57PEVL 14M33.2NoMeasuring Progress in Fine-grained Vision-and-La...2023-05-12Code
58ALBEF 14M32.5NoMeasuring Progress in Fine-grained Vision-and-La...2023-05-12Code
59FLAVA (ITM)32.25NoWinoground: Probing Vision and Language Models f...2022-04-07Code
60UNITER base32.25NoWinoground: Probing Vision and Language Models f...2022-04-07Code
61CLIP (SGVL)32NoIncorporating Structured Representations into Pr...2023-05-10-
62ViT-B/16 + BERT base31.2No---
63Gemini30.75NoCoCoT: Contrastive Chain-of-Thought Prompting fo...2024-01-05Code
64OCLIP (ViT-H/14) 30.75NoSelfEval: Leveraging the discriminative nature o...2023-11-17-
65CLIP (ViT-B/32)30.75NoWinoground: Probing Vision and Language Models f...2022-04-07Code
66OFA large (ITM)30.75NoSimple Token-Level Confidence Improves Caption C...2023-05-11-
67KeyComp (GPT-3.5)30.3NoPrompting Large Vision-Language Models for Compo...2024-01-20Code
68CLIP (ViT-L/14)30.25NoSelfEval: Leveraging the discriminative nature o...2023-11-17-
69ViLLA base30NoWinoground: Probing Vision and Language Models f...2022-04-07Code
70syn-CLIP30NoGoing Beyond Nouns With Vision & Language Models...2023-03-30Code
71syn-CyCLIP30NoGoing Beyond Nouns With Vision & Language Models...2023-03-30Code
72NegCLIP29.5NoIncorporating Structured Representations into Pr...2023-05-10-
73OFA large (TLC-A)29.25NoSimple Token-Level Confidence Improves Caption C...2023-05-11-
74ALBEF 4M29.2NoMeasuring Progress in Fine-grained Vision-and-La...2023-05-12Code
75LDM-T5 (SelfEval)29NoSelfEval: Leveraging the discriminative nature o...2023-11-17-
76CyCLIP28.5NoGoing Beyond Nouns With Vision & Language Models...2023-03-30Code
77PDM-T5 (SelfEval)28.25NoSelfEval: Leveraging the discriminative nature o...2023-11-17-
78COCA ViT-L14 (f.t on COCO)28.25NoWhat You See is What You Read? Improving Text-Im...2023-05-17Code
79LLaVA-1.5-ZS-CoT28NoCompositional Chain-of-Thought Prompting for Lar...2023-11-27Code
80BLIP (ITC)28NoRevisiting the Role of Language Priors in Vision...2023-06-02Code
81OFA large (ft SNLI-VE)27.7NoWhat You See is What You Read? Improving Text-Im...2023-05-17Code
82OFA base (ITM)26.75NoSimple Token-Level Confidence Improves Caption C...2023-05-11-
83CLIP RN50x6426.5NoWhat You See is What You Read? Improving Text-Im...2023-05-17Code
84LLaVA-7B (GPTScore)25.5NoAn Examination of the Compositionality of Large ...2023-08-21Code
85FLAVA (contrastive)25.25NoWinoground: Probing Vision and Language Models f...2022-04-07Code
86Random chance25NoWinoground: Probing Vision and Language Models f...2022-04-07Code
87LLaVA24.8NoIncorporating Structured Representations into Pr...2023-05-10-
88OFA base (TLC-A)24.5NoSimple Token-Level Confidence Improves Caption C...2023-05-11-
89MiniGPT-4-7B (GPTScore)24.5NoAn Examination of the Compositionality of Large ...2023-08-21Code
90ViLBERT base23.75NoWinoground: Probing Vision and Language Models f...2022-04-07Code
91MiniGPT-423.3NoIncorporating Structured Representations into Pr...2023-05-10-
92MiniGPT-4-7B (VisualGPTScore)23.25NoAn Examination of the Compositionality of Large ...2023-08-21Code
93VSE++ (COCO, ResNet)22.75NoWinoground: Probing Vision and Language Models f...2022-04-07Code
94OFA tiny (ITM)22.75NoSimple Token-Level Confidence Improves Caption C...2023-05-11-
95LDM-CLIP (SelfEval)22.75NoSelfEval: Leveraging the discriminative nature o...2023-11-17-
96Gemini + CCoT22.5NoCoCoT: Contrastive Chain-of-Thought Prompting fo...2024-01-05Code
97InstructBLIP-CCoT 21NoCompositional Chain-of-Thought Prompting for Lar...2023-11-27Code
98VSRN (Flickr30k)20NoWinoground: Probing Vision and Language Models f...2022-04-07Code
99VSE++ (Flickr30k, ResNet)20NoWinoground: Probing Vision and Language Models f...2022-04-07Code
100VSE++ (Flickr30k, VGG)19.75NoWinoground: Probing Vision and Language Models f...2022-04-07Code
101UniT (ITM finetuned)19.5NoWinoground: Probing Vision and Language Models f...2022-04-07Code
102LXMERT19.25NoWinoground: Probing Vision and Language Models f...2022-04-07Code
103TIFA19NoWhat You See is What You Read? Improving Text-Im...2023-05-17Code
104IDEFICS 80B18.75No---
105VSE++ (COCO, VGG)18.75NoWinoground: Probing Vision and Language Models f...2022-04-07Code
106VSRN (COCO)17.5NoWinoground: Probing Vision and Language Models f...2022-04-07Code
107PDM-CLIP (SelfEval)17NoSelfEval: Leveraging the discriminative nature o...2023-11-17-
108IDEFICS 9B16.8No---
109OFA tiny (TLC-A)16.5NoSimple Token-Level Confidence Improves Caption C...2023-05-11-
110VisualBERT base15.5NoWinoground: Probing Vision and Language Models f...2022-04-07Code
111MiniGPT-4-7B (BERTScore)14NoAn Examination of the Compositionality of Large ...2023-08-21Code
112LLaVA-7B (BERTScore)13.5NoAn Examination of the Compositionality of Large ...2023-08-21Code
113InstructBLIP-ZS-CoT9.3NoCompositional Chain-of-Thought Prompting for Lar...2023-11-27Code
114InstructBLIP7NoCompositional Chain-of-Thought Prompting for Lar...2023-11-27Code