| 1 | GPT-4V (CoT, pick b/w two options) | 68.75 | No | The Role of Chain-of-Thought in Complex Vision-L... | 2023-11-15 | - |
| 2 | GPT-4o + CA | 58.5 | No | A Cognitive Paradigm Approach to Probe the Perce... | 2025-01-23 | - |
| 3 | OpenFlamingo + CoCoT | 55.25 | No | CoCoT: Contrastive Chain-of-Thought Prompting fo... | 2024-01-05 | Code |
| 4 | MMICL + CoCoT | 52.5 | No | CoCoT: Contrastive Chain-of-Thought Prompting fo... | 2024-01-05 | Code |
| 5 | GPT-4V + CoCoT | 49.5 | No | CoCoT: Contrastive Chain-of-Thought Prompting fo... | 2024-01-05 | Code |
| 6 | MMICL + CCoT | 48 | No | CoCoT: Contrastive Chain-of-Thought Prompting fo... | 2024-01-05 | Code |
| 7 | OpenFlamingo + DDCoT | 47.25 | No | CoCoT: Contrastive Chain-of-Thought Prompting fo... | 2024-01-05 | Code |
| 8 | GPT-4V (pick b/w two options) | 46.25 | No | The Role of Chain-of-Thought in Complex Vision-L... | 2023-11-15 | - |
| 9 | MMICL + DDCoT | 45 | No | CoCoT: Contrastive Chain-of-Thought Prompting fo... | 2024-01-05 | Code |
| 10 | MMICL (FLAN-T5-XXL) | 44.99 | No | MMICL: Empowering Vision-language Model with Mul... | 2023-09-14 | Code |
| 11 | GPT-4V | 42.5 | No | CoCoT: Contrastive Chain-of-Thought Prompting fo... | 2024-01-05 | Code |
| 12 | VQ2 | 42.2 | No | What You See is What You Read? Improving Text-Im... | 2023-05-17 | Code |
| 13 | PaLI (ft SNLI-VE) | 41.5 | No | What You See is What You Read? Improving Text-Im... | 2023-05-17 | Code |
| 14 | OpenFlamingo | 41.25 | No | CoCoT: Contrastive Chain-of-Thought Prompting fo... | 2024-01-05 | Code |
| 15 | PaLI (ft SNLI-VE + Synthetic Data) | 38 | No | What You See is What You Read? Improving Text-Im... | 2023-05-17 | Code |
| 16 | GPT-4V (image-caption match answer yes/no, zero-shot) | 38 | No | - | - | - |
| 17 | LLaVA-1.5-CCoT | 35.5 | No | Compositional Chain-of-Thought Prompting for Lar... | 2023-11-27 | Code |
| 18 | LLaVA-1.5 | 33.3 | No | Compositional Chain-of-Thought Prompting for Lar... | 2023-11-27 | Code |
| 19 | Gemini + CCoT | 33 | No | CoCoT: Contrastive Chain-of-Thought Prompting fo... | 2024-01-05 | Code |
| 20 | Gemini + CoCoT | 32.5 | No | CoCoT: Contrastive Chain-of-Thought Prompting fo... | 2024-01-05 | Code |
| 21 | FIBER (EqSim) | 32 | No | Equivariant Similarity for Vision-Language Found... | 2023-03-25 | Code |
| 22 | KeyComp* (GPT-4) | 28.7 | No | Prompting Large Vision-Language Models for Compo... | 2024-01-20 | Code |
| 23 | BLIP2 (SGVL) | 28.5 | No | Incorporating Structured Representations into Pr... | 2023-05-10 | - |
| 24 | KeyComp* (GPT-3.5) | 27.8 | No | Prompting Large Vision-Language Models for Compo... | 2024-01-20 | Code |
| 25 | OpenFlamingo + CCoT | 27.5 | No | CoCoT: Contrastive Chain-of-Thought Prompting fo... | 2024-01-05 | Code |
| 26 | BLIP (SGVL) | 27.3 | No | Incorporating Structured Representations into Pr... | 2023-05-10 | - |
| 27 | OFA large (TLC-A) | 27 | No | Simple Token-Level Confidence Improves Caption C... | 2023-05-11 | - |
| 28 | X-VLM 4M | 26.7 | No | Measuring Progress in Fine-grained Vision-and-La... | 2023-05-12 | Code |
| 29 | FIBER (finetuned, Flickr30k) | 26.5 | No | Equivariant Similarity for Vision-Language Found... | 2023-03-25 | Code |
| 30 | BLIP2 (ft COCO) | 26 | No | What You See is What You Read? Improving Text-Im... | 2023-05-17 | Code |
| 31 | NegBLIP2 | 26 | No | Incorporating Structured Representations into Pr... | 2023-05-10 | - |
| 32 | Gemini | 26 | No | CoCoT: Contrastive Chain-of-Thought Prompting fo... | 2024-01-05 | Code |
| 33 | FIBER | 25.75 | No | Equivariant Similarity for Vision-Language Found... | 2023-03-25 | Code |
| 34 | BLIP (+Graph Text, +Graph Neg) | 25.5 | No | Incorporating Structured Representations into Pr... | 2023-05-10 | - |
| 35 | Gemini + DDCoT | 25 | No | CoCoT: Contrastive Chain-of-Thought Prompting fo... | 2024-01-05 | Code |
| 36 | Random chance | 25 | No | Winoground: Probing Vision and Language Models f... | 2022-04-07 | Code |
| 37 | LLaVA | 25 | No | Incorporating Structured Representations into Pr... | 2023-05-10 | - |
| 38 | KeyComp (GPT-3.5) | 24.6 | No | Prompting Large Vision-Language Models for Compo... | 2024-01-20 | Code |
| 39 | X-VLM 16M | 24.5 | No | Measuring Progress in Fine-grained Vision-and-La... | 2023-05-12 | Code |
| 40 | NegBLIP | 24 | No | Incorporating Structured Representations into Pr... | 2023-05-10 | - |
| 41 | BLIP2 | 23.8 | No | Incorporating Structured Representations into Pr... | 2023-05-10 | - |
| 42 | OFA base (TLC-A) | 23.5 | No | Simple Token-Level Confidence Improves Caption C... | 2023-05-11 | - |
| 43 | METER (EqSim) | 22.75 | No | Equivariant Similarity for Vision-Language Found... | 2023-03-25 | Code |
| 44 | LLaVA-1.5-ZS-CoT | 22.5 | No | Compositional Chain-of-Thought Prompting for Lar... | 2023-11-27 | Code |
| 45 | IDEFICS 80B | 22.5 | No | - | - | - |
| 46 | MiniGPT-4-7B (GPTScore) | 21.75 | No | An Examination of the Compositionality of Large ... | 2023-08-21 | Code |
| 47 | BLIP (VisualGPTScore, α-tuned) | 21.5 | No | Revisiting the Role of Language Priors in Vision... | 2023-06-02 | Code |
| 48 | InstructBLIP-CCoT | 21.3 | No | Compositional Chain-of-Thought Prompting for Lar... | 2023-11-27 | Code |
| 49 | IDEFICS 9B | 20.8 | No | - | - | - |
| 50 | METER (finetuned, Flickr30k) | 20.75 | No | Equivariant Similarity for Vision-Language Found... | 2023-03-25 | Code |
| 51 | BLIP (+Graph Text) | 20.5 | No | Incorporating Structured Representations into Pr... | 2023-05-10 | - |
| 52 | FLAVA (ITM) | 20.5 | No | Winoground: Probing Vision and Language Models f... | 2022-04-07 | Code |
| 53 | IAIS large (Flickr30k) | 19.75 | No | - | - | - |
| 54 | IAIS large (COCO) | 19.75 | No | - | - | - |
| 55 | BLIP | 19.2 | No | Incorporating Structured Representations into Pr... | 2023-05-10 | - |
| 56 | BLIP 14M | 18.5 | No | Measuring Progress in Fine-grained Vision-and-La... | 2023-05-12 | Code |
| 57 | MiniGPT-4 | 18 | No | Incorporating Structured Representations into Pr... | 2023-05-10 | - |
| 58 | MiniGPT-4-7B (VisualGPTScore) | 18 | No | An Examination of the Compositionality of Large ... | 2023-08-21 | Code |
| 59 | CACR base | 17.75 | No | - | - | - |
| 60 | VinVL | 17.75 | No | Winoground: Probing Vision and Language Models f... | 2022-04-07 | Code |
| 61 | LLaVA-7B (GPTScore) | 17 | No | An Examination of the Compositionality of Large ... | 2023-08-21 | Code |
| 62 | InstructBLIP-ZS-CoT | 16.3 | No | Compositional Chain-of-Thought Prompting for Lar... | 2023-11-27 | Code |
| 63 | ALBEF 14M | 16.2 | No | Measuring Progress in Fine-grained Vision-and-La... | 2023-05-12 | Code |
| 64 | BLIP (ITM) | 15.8 | No | Revisiting the Role of Language Priors in Vision... | 2023-06-02 | Code |
| 65 | METER | 15.75 | No | Equivariant Similarity for Vision-Language Found... | 2023-03-25 | Code |
| 66 | OFA tiny (TLC-A) | 15.75 | No | Simple Token-Level Confidence Improves Caption C... | 2023-05-11 | - |
| 67 | PEVL 14M | 15.7 | No | Measuring Progress in Fine-grained Vision-and-La... | 2023-05-12 | Code |
| 68 | ALBEF 4M | 15.5 | No | Measuring Progress in Fine-grained Vision-and-La... | 2023-05-12 | Code |
| 69 | ROSITA (Flickr30k) | 15.25 | No | - | - | - |
| 70 | BLIP 129M (CapFilt/L) | 15.2 | No | Measuring Progress in Fine-grained Vision-and-La... | 2023-05-12 | Code |
| 71 | BLIP 129M | 15 | No | Measuring Progress in Fine-grained Vision-and-La... | 2023-05-12 | Code |
| 72 | BLIP-ViT/L 129M | 14.5 | No | Measuring Progress in Fine-grained Vision-and-La... | 2023-05-12 | Code |
| 73 | OFA large (ft SNLI-VE) | 14.3 | No | What You See is What You Read? Improving Text-Im... | 2023-05-17 | Code |
| 74 | UNITER large | 14 | No | Winoground: Probing Vision and Language Models f... | 2022-04-07 | Code |
| 75 | ViLT (ViT-B/32) | 14 | No | Winoground: Probing Vision and Language Models f... | 2022-04-07 | Code |
| 76 | CLIP (SGVL) | 14 | No | Incorporating Structured Representations into Pr... | 2023-05-10 | - |
| 77 | PDM-CLIP (SelfEval) | 14 | No | SelfEval: Leveraging the discriminative nature o... | 2023-11-17 | - |
| 78 | CLIP RN50x64 | 13.75 | No | What You See is What You Read? Improving Text-Im... | 2023-05-17 | Code |
| 79 | LDM-T5 (SelfEval) | 13.5 | No | SelfEval: Leveraging the discriminative nature o... | 2023-11-17 | - |
| 80 | FLAVA (contrastive) | 13.5 | No | Winoground: Probing Vision and Language Models f... | 2022-04-07 | Code |
| 81 | ViLLA large | 13.25 | No | Winoground: Probing Vision and Language Models f... | 2022-04-07 | Code |
| 82 | UNITER base | 13.25 | No | Winoground: Probing Vision and Language Models f... | 2022-04-07 | Code |
| 83 | OCLIP (ViT-H/14) | 12.75 | No | SelfEval: Leveraging the discriminative nature o... | 2023-11-17 | - |
| 84 | TIFA | 12.5 | No | What You See is What You Read? Improving Text-Im... | 2023-05-17 | Code |
| 85 | ViLLA base | 12 | No | Winoground: Probing Vision and Language Models f... | 2022-04-07 | Code |
| 86 | PDM-T5 (SelfEval) | 12 | No | SelfEval: Leveraging the discriminative nature o... | 2023-11-17 | - |
| 87 | syn-CLIP | 11.5 | No | Going Beyond Nouns With Vision & Language Models... | 2023-03-30 | Code |
| 88 | COCA ViT-L14 (f.t on COCO) | 11.5 | No | What You See is What You Read? Improving Text-Im... | 2023-05-17 | Code |
| 89 | InstructBLIP | 11.5 | No | Compositional Chain-of-Thought Prompting for Lar... | 2023-11-27 | Code |
| 90 | syn-CyCLIP | 10.75 | No | Going Beyond Nouns With Vision & Language Models... | 2023-03-30 | Code |
| 91 | OFA base (ITM) | 10.75 | No | Simple Token-Level Confidence Improves Caption C... | 2023-05-11 | - |
| 92 | CLIP (ViT-B/32) | 10.5 | No | Winoground: Probing Vision and Language Models f... | 2022-04-07 | Code |
| 93 | NegCLIP | 10.5 | No | Incorporating Structured Representations into Pr... | 2023-05-10 | - |
| 94 | OFA large (ITM) | 10.25 | No | Simple Token-Level Confidence Improves Caption C... | 2023-05-11 | - |
| 95 | CyCLIP | 9.5 | No | Going Beyond Nouns With Vision & Language Models... | 2023-03-30 | Code |
| 96 | BLIP (ITC) | 9 | No | Revisiting the Role of Language Priors in Vision... | 2023-06-02 | Code |
| 97 | CLIP (ViT-L/14) | 8 | No | SelfEval: Leveraging the discriminative nature o... | 2023-11-17 | - |
| 98 | VSE++ (COCO, ResNet) | 8 | No | Winoground: Probing Vision and Language Models f... | 2022-04-07 | Code |
| 99 | MiniGPT-4-7B (BERTScore) | 8 | No | An Examination of the Compositionality of Large ... | 2023-08-21 | Code |
| 100 | OFA tiny (ITM) | 7.75 | No | Simple Token-Level Confidence Improves Caption C... | 2023-05-11 | - |
| 101 | ViLBERT base | 7.25 | No | Winoground: Probing Vision and Language Models f... | 2022-04-07 | Code |
| 102 | LDM-CLIP (SelfEval) | 7.25 | No | SelfEval: Leveraging the discriminative nature o... | 2023-11-17 | - |
| 103 | LXMERT | 7 | No | Winoground: Probing Vision and Language Models f... | 2022-04-07 | Code |
| 104 | VSRN (COCO) | 7 | No | Winoground: Probing Vision and Language Models f... | 2022-04-07 | Code |
| 105 | VSE++ (Flickr30k, VGG) | 6.25 | No | Winoground: Probing Vision and Language Models f... | 2022-04-07 | Code |
| 106 | UniT (ITM finetuned) | 6.25 | No | Winoground: Probing Vision and Language Models f... | 2022-04-07 | Code |
| 107 | VSE++ (COCO, VGG) | 5.5 | No | Winoground: Probing Vision and Language Models f... | 2022-04-07 | Code |
| 108 | LLaVA-7B (BERTScore) | 5.25 | No | An Examination of the Compositionality of Large ... | 2023-08-21 | Code |
| 109 | VSRN (Flickr30k) | 5 | No | Winoground: Probing Vision and Language Models f... | 2022-04-07 | Code |
| 110 | VSE++ (Flickr30k, ResNet) | 5 | No | Winoground: Probing Vision and Language Models f... | 2022-04-07 | Code |
| 111 | VisualBERT base | 2.5 | No | Winoground: Probing Vision and Language Models f... | 2022-04-07 | Code |