| 1 | GPT-4V (CoT, pick b/w two options) | 58.75 | No | The Role of Chain-of-Thought in Complex Vision-L... | 2023-11-15 | - |
| 2 | GPT-4o + CA | 52 | No | A Cognitive Paradigm Approach to Probe the Perce... | 2025-01-23 | - |
| 3 | MMICL + CoCoT | 50.75 | No | CoCoT: Contrastive Chain-of-Thought Prompting fo... | 2024-01-05 | Code |
| 4 | MMICL + CCoT | 47.5 | No | CoCoT: Contrastive Chain-of-Thought Prompting fo... | 2024-01-05 | Code |
| 5 | GPT-4V + CoCoT | 44.5 | No | CoCoT: Contrastive Chain-of-Thought Prompting fo... | 2024-01-05 | Code |
| 6 | MMICL (FLAN-T5-XXL) | 43 | No | MMICL: Empowering Vision-language Model with Mul... | 2023-09-14 | Code |
| 7 | OpenFlamingo + CoCoT | 41.5 | No | CoCoT: Contrastive Chain-of-Thought Prompting fo... | 2024-01-05 | Code |
| 8 | GPT-4V (pick b/w two options) | 39.25 | No | The Role of Chain-of-Thought in Complex Vision-L... | 2023-11-15 | - |
| 9 | OpenFlamingo + DDCoT | 39 | No | CoCoT: Contrastive Chain-of-Thought Prompting fo... | 2024-01-05 | Code |
| 10 | GPT-4V (image-caption match answer yes/no, zero-shot) | 38 | No | - | - | - |
| 11 | GPT-4V | 37.75 | No | CoCoT: Contrastive Chain-of-Thought Prompting fo... | 2024-01-05 | Code |
| 12 | MMICL + DDCoT | 36.75 | No | CoCoT: Contrastive Chain-of-Thought Prompting fo... | 2024-01-05 | Code |
| 13 | OpenFlamingo | 33.25 | No | CoCoT: Contrastive Chain-of-Thought Prompting fo... | 2024-01-05 | Code |
| 14 | VQ2 | 30.5 | No | What You See is What You Read? Improving Text-Im... | 2023-05-17 | Code |
| 15 | PaLI (ft SNLI-VE + Synthetic Data) | 28.75 | No | What You See is What You Read? Improving Text-Im... | 2023-05-17 | Code |
| 16 | PaLI (ft SNLI-VE) | 28.7 | No | What You See is What You Read? Improving Text-Im... | 2023-05-17 | Code |
| 17 | Gemini + CoCoT | 27.75 | No | CoCoT: Contrastive Chain-of-Thought Prompting fo... | 2024-01-05 | Code |
| 18 | FIBER (EqSim) | 27.5 | No | Equivariant Similarity for Vision-Language Found... | 2023-03-25 | Code |
| 19 | Gemini | 25 | No | CoCoT: Contrastive Chain-of-Thought Prompting fo... | 2024-01-05 | Code |
| 20 | Gemini + DDCoT | 23.75 | No | CoCoT: Contrastive Chain-of-Thought Prompting fo... | 2024-01-05 | Code |
| 21 | BLIP2 (ft COCO) | 23.5 | No | What You See is What You Read? Improving Text-Im... | 2023-05-17 | Code |
| 22 | BLIP2 (SGVL) | 23.3 | No | Incorporating Structured Representations into Pr... | 2023-05-10 | - |
| 23 | FIBER (finetuned, Flickr30k) | 23 | No | Equivariant Similarity for Vision-Language Found... | 2023-03-25 | Code |
| 24 | LLaVA-1.5-CCoT | 22.3 | No | Compositional Chain-of-Thought Prompting for Lar... | 2023-11-27 | Code |
| 25 | FIBER | 22.25 | No | Equivariant Similarity for Vision-Language Found... | 2023-03-25 | Code |
| 26 | X-VLM 4M | 21.5 | No | Measuring Progress in Fine-grained Vision-and-La... | 2023-05-12 | Code |
| 27 | BLIP (SGVL) | 21.5 | No | Incorporating Structured Representations into Pr... | 2023-05-10 | - |
| 28 | X-VLM 16M | 21.2 | No | Measuring Progress in Fine-grained Vision-and-La... | 2023-05-12 | Code |
| 29 | Gemini + CCoT | 20.75 | No | CoCoT: Contrastive Chain-of-Thought Prompting fo... | 2024-01-05 | Code |
| 30 | NegBLIP2 | 20.5 | No | Incorporating Structured Representations into Pr... | 2023-05-10 | - |
| 31 | LLaVA-1.5 | 20.1 | No | Compositional Chain-of-Thought Prompting for Lar... | 2023-11-27 | Code |
| 32 | OpenFlamingo + CCoT | 20 | No | CoCoT: Contrastive Chain-of-Thought Prompting fo... | 2024-01-05 | Code |
| 33 | BLIP2 | 19 | No | Incorporating Structured Representations into Pr... | 2023-05-10 | - |
| 34 | BLIP (+Graph Text, +Graph Neg) | 19 | No | Incorporating Structured Representations into Pr... | 2023-05-10 | - |
| 35 | METER (EqSim) | 18.75 | No | Equivariant Similarity for Vision-Language Found... | 2023-03-25 | Code |
| 36 | NegBLIP | 18.5 | No | Incorporating Structured Representations into Pr... | 2023-05-10 | - |
| 37 | KeyComp* (GPT-4) | 18.2 | No | Prompting Large Vision-Language Models for Compo... | 2024-01-20 | Code |
| 38 | OFA large (TLC-A) | 17.5 | No | Simple Token-Level Confidence Improves Caption C... | 2023-05-11 | - |
| 39 | KeyComp* (GPT-3.5) | 17.4 | No | Prompting Large Vision-Language Models for Compo... | 2024-01-20 | Code |
| 40 | BLIP (VisualGPTScore, α-tuned) | 16.8 | No | Revisiting the Role of Language Priors in Vision... | 2023-06-02 | Code |
| 41 | Random chance | 16.67 | No | Winoground: Probing Vision and Language Models f... | 2022-04-07 | Code |
| 42 | BLIP (+Graph Text) | 16.5 | No | Incorporating Structured Representations into Pr... | 2023-05-10 | - |
| 43 | IAIS large (Flickr30k) | 16 | No | - | - | - |
| 44 | IAIS large (COCO) | 15.5 | No | - | - | - |
| 45 | BLIP | 15 | No | Incorporating Structured Representations into Pr... | 2023-05-10 | - |
| 46 | METER (finetuned, Flickr30k) | 14.75 | No | Equivariant Similarity for Vision-Language Found... | 2023-03-25 | Code |
| 47 | VinVL | 14.5 | No | Winoground: Probing Vision and Language Models f... | 2022-04-07 | Code |
| 48 | BLIP 14M | 14.5 | No | Measuring Progress in Fine-grained Vision-and-La... | 2023-05-12 | Code |
| 49 | CACR base | 14.25 | No | - | - | - |
| 50 | FLAVA (ITM) | 14.25 | No | Winoground: Probing Vision and Language Models f... | 2022-04-07 | Code |
| 51 | OFA base (TLC-A) | 13.75 | No | Simple Token-Level Confidence Improves Caption C... | 2023-05-11 | - |
| 52 | BLIP (ITM) | 13.3 | No | Revisiting the Role of Language Priors in Vision... | 2023-06-02 | Code |
| 53 | LLaVA | 13 | No | Incorporating Structured Representations into Pr... | 2023-05-10 | - |
| 54 | ALBEF 14M | 12.7 | No | Measuring Progress in Fine-grained Vision-and-La... | 2023-05-12 | Code |
| 55 | KeyComp (GPT-3.5) | 12.4 | No | Prompting Large Vision-Language Models for Compo... | 2024-01-20 | Code |
| 56 | LLaVA-1.5-ZS-CoT | 12.3 | No | Compositional Chain-of-Thought Prompting for Lar... | 2023-11-27 | Code |
| 57 | ROSITA (Flickr30k) | 12.25 | No | - | - | - |
| 58 | BLIP 129M (CapFilt/L) | 12.2 | No | Measuring Progress in Fine-grained Vision-and-La... | 2023-05-12 | Code |
| 59 | BLIP-ViT/L 129M | 12.2 | No | Measuring Progress in Fine-grained Vision-and-La... | 2023-05-12 | Code |
| 60 | PEVL 14M | 12.2 | No | Measuring Progress in Fine-grained Vision-and-La... | 2023-05-12 | Code |
| 61 | METER | 12 | No | Equivariant Similarity for Vision-Language Found... | 2023-03-25 | Code |
| 62 | BLIP 129M | 11.7 | No | Measuring Progress in Fine-grained Vision-and-La... | 2023-05-12 | Code |
| 63 | MiniGPT-4-7B (GPTScore) | 11.5 | No | An Examination of the Compositionality of Large ... | 2023-08-21 | Code |
| 64 | TIFA | 11.3 | No | What You See is What You Read? Improving Text-Im... | 2023-05-17 | Code |
| 65 | ViLLA large | 11 | No | Winoground: Probing Vision and Language Models f... | 2022-04-07 | Code |
| 66 | ALBEF 4M | 11 | No | Measuring Progress in Fine-grained Vision-and-La... | 2023-05-12 | Code |
| 67 | UNITER large | 10.5 | No | Winoground: Probing Vision and Language Models f... | 2022-04-07 | Code |
| 68 | LLaVA-7B (GPTScore) | 10.5 | No | An Examination of the Compositionality of Large ... | 2023-08-21 | Code |
| 69 | CLIP RN50x64 | 10.25 | No | What You See is What You Read? Improving Text-Im... | 2023-05-17 | Code |
| 70 | UNITER base | 10 | No | Winoground: Probing Vision and Language Models f... | 2022-04-07 | Code |
| 71 | CLIP (SGVL) | 9.8 | No | Incorporating Structured Representations into Pr... | 2023-05-10 | - |
| 72 | syn-CLIP | 9.5 | No | Going Beyond Nouns With Vision & Language Models... | 2023-03-30 | Code |
| 73 | MiniGPT-4 | 9.5 | No | Incorporating Structured Representations into Pr... | 2023-05-10 | - |
| 74 | MiniGPT-4-7B (VisualGPTScore) | 9.5 | No | An Examination of the Compositionality of Large ... | 2023-08-21 | Code |
| 75 | ViLT (ViT-B/32) | 9.25 | No | Winoground: Probing Vision and Language Models f... | 2022-04-07 | Code |
| 76 | OFA large (ft SNLI-VE) | 9 | No | What You See is What You Read? Improving Text-Im... | 2023-05-17 | Code |
| 77 | FLAVA (contrastive) | 9 | No | Winoground: Probing Vision and Language Models f... | 2022-04-07 | Code |
| 78 | InstructBLIP-CCoT | 8.3 | No | Compositional Chain-of-Thought Prompting for Lar... | 2023-11-27 | Code |
| 79 | syn-CyCLIP | 8.25 | No | Going Beyond Nouns With Vision & Language Models... | 2023-03-30 | Code |
| 80 | COCA ViT-L14 (f.t on COCO) | 8.25 | No | What You See is What You Read? Improving Text-Im... | 2023-05-17 | Code |
| 81 | CLIP (ViT-B/32) | 8 | No | Winoground: Probing Vision and Language Models f... | 2022-04-07 | Code |
| 82 | ViLLA base | 8 | No | Winoground: Probing Vision and Language Models f... | 2022-04-07 | Code |
| 83 | NegCLIP | 8 | No | Incorporating Structured Representations into Pr... | 2023-05-10 | - |
| 84 | IDEFICS 80B | 8 | No | - | - | - |
| 85 | OFA large (ITM) | 7.25 | No | Simple Token-Level Confidence Improves Caption C... | 2023-05-11 | - |
| 86 | CyCLIP | 7.25 | No | Going Beyond Nouns With Vision & Language Models... | 2023-03-30 | Code |
| 87 | OFA tiny (TLC-A) | 6.75 | No | Simple Token-Level Confidence Improves Caption C... | 2023-05-11 | - |
| 88 | BLIP (ITC) | 6.5 | No | Revisiting the Role of Language Priors in Vision... | 2023-06-02 | Code |
| 89 | OFA base (ITM) | 6.5 | No | Simple Token-Level Confidence Improves Caption C... | 2023-05-11 | - |
| 90 | IDEFICS 9B | 5 | No | - | - | - |
| 91 | ViLBERT base | 4.75 | No | Winoground: Probing Vision and Language Models f... | 2022-04-07 | Code |
| 92 | OFA tiny (ITM) | 4.5 | No | Simple Token-Level Confidence Improves Caption C... | 2023-05-11 | - |
| 93 | VSE++ (Flickr30k, VGG) | 4.5 | No | Winoground: Probing Vision and Language Models f... | 2022-04-07 | Code |
| 94 | VSE++ (COCO, ResNet) | 4 | No | Winoground: Probing Vision and Language Models f... | 2022-04-07 | Code |
| 95 | UniT (ITM finetuned) | 4 | No | Winoground: Probing Vision and Language Models f... | 2022-04-07 | Code |
| 96 | LXMERT | 4 | No | Winoground: Probing Vision and Language Models f... | 2022-04-07 | Code |
| 97 | InstructBLIP-ZS-CoT | 4 | No | Compositional Chain-of-Thought Prompting for Lar... | 2023-11-27 | Code |
| 98 | VSRN (COCO) | 3.75 | No | Winoground: Probing Vision and Language Models f... | 2022-04-07 | Code |
| 99 | VSRN (Flickr30k) | 3.5 | No | Winoground: Probing Vision and Language Models f... | 2022-04-07 | Code |
| 100 | VSE++ (COCO, VGG) | 3.5 | No | Winoground: Probing Vision and Language Models f... | 2022-04-07 | Code |
| 101 | InstructBLIP | 3.3 | No | Compositional Chain-of-Thought Prompting for Lar... | 2023-11-27 | Code |
| 102 | VSE++ (Flickr30k, ResNet) | 2.75 | No | Winoground: Probing Vision and Language Models f... | 2022-04-07 | Code |
| 103 | MiniGPT-4-7B (BERTScore) | 2.75 | No | An Examination of the Compositionality of Large ... | 2023-08-21 | Code |
| 104 | LLaVA-7B (BERTScore) | 2.25 | No | An Examination of the Compositionality of Large ... | 2023-08-21 | Code |
| 105 | VisualBERT base | 1.5 | No | Winoground: Probing Vision and Language Models f... | 2022-04-07 | Code |