| 1 | GPT-4o + CA | 75.5 | No | A Cognitive Paradigm Approach to Probe the Perce... | 2025-01-23 | - |
| 2 | GPT-4V (CoT, pick b/w two options) | 75.25 | No | The Role of Chain-of-Thought in Complex Vision-L... | 2023-11-15 | - |
| 3 | GPT-4V (pick b/w two options) | 69.25 | No | The Role of Chain-of-Thought in Complex Vision-L... | 2023-11-15 | - |
| 4 | MMICL + CoCoT | 64.25 | No | CoCoT: Contrastive Chain-of-Thought Prompting fo... | 2024-01-05 | Code |
| 5 | GPT-4V + CoCoT | 58.5 | No | CoCoT: Contrastive Chain-of-Thought Prompting fo... | 2024-01-05 | Code |
| 6 | OpenFlamingo + CoCoT | 58.25 | No | CoCoT: Contrastive Chain-of-Thought Prompting fo... | 2024-01-05 | Code |
| 7 | GPT-4V | 54.5 | No | CoCoT: Contrastive Chain-of-Thought Prompting fo... | 2024-01-05 | Code |
| 8 | FIBER (EqSim) | 51.5 | No | Equivariant Similarity for Vision-Language Found... | 2023-03-25 | Code |
| 9 | FIBER (finetuned, Flickr30k) | 51.25 | No | Equivariant Similarity for Vision-Language Found... | 2023-03-25 | Code |
| 10 | MMICL + CCoT | 51 | No | CoCoT: Contrastive Chain-of-Thought Prompting fo... | 2024-01-05 | Code |
| 11 | OpenFlamingo + DDCoT | 47.5 | No | CoCoT: Contrastive Chain-of-Thought Prompting fo... | 2024-01-05 | Code |
| 12 | VQ2 | 47 | No | What You See is What You Read? Improving Text-Im... | 2023-05-17 | Code |
| 13 | MMICL + DDCoT | 46.75 | No | CoCoT: Contrastive Chain-of-Thought Prompting fo... | 2024-01-05 | Code |
| 14 | X-VLM 16M | 46.7 | No | Measuring Progress in Fine-grained Vision-and-La... | 2023-05-12 | Code |
| 15 | PaLI (ft SNLI-VE + Synthetic Data) | 46.5 | No | What You See is What You Read? Improving Text-Im... | 2023-05-17 | Code |
| 16 | FIBER | 46.25 | No | Equivariant Similarity for Vision-Language Found... | 2023-03-25 | Code |
| 17 | MMICL (FLAN-T5-XXL) | 45.5 | No | MMICL: Empowering Vision-language Model with Mul... | 2023-09-14 | Code |
| 18 | PaLI (ft SNLI-VE) | 45 | No | What You See is What You Read? Improving Text-Im... | 2023-05-17 | Code |
| 19 | Gemini + DDCoT | 45 | No | CoCoT: Contrastive Chain-of-Thought Prompting fo... | 2024-01-05 | Code |
| 20 | METER (EqSim) | 45 | No | Equivariant Similarity for Vision-Language Found... | 2023-03-25 | Code |
| 21 | X-VLM 4M | 44 | No | Measuring Progress in Fine-grained Vision-and-La... | 2023-05-12 | Code |
| 22 | BLIP2 (ft COCO) | 44 | No | What You See is What You Read? Improving Text-Im... | 2023-05-17 | Code |
| 23 | KeyComp* (GPT-4) | 43.5 | No | Prompting Large Vision-Language Models for Compo... | 2024-01-20 | Code |
| 24 | METER (finetuned, Flickr30k) | 43.5 | No | Equivariant Similarity for Vision-Language Found... | 2023-03-25 | Code |
| 25 | BLIP2 (SGVL) | 42.8 | No | Incorporating Structured Representations into Pr... | 2023-05-10 | - |
| 26 | BLIP (SGVL) | 42.8 | No | Incorporating Structured Representations into Pr... | 2023-05-10 | - |
| 27 | KeyComp* (GPT-3.5) | 42.7 | No | Prompting Large Vision-Language Models for Compo... | 2024-01-20 | Code |
| 28 | OpenFlamingo + CCoT | 42.5 | No | CoCoT: Contrastive Chain-of-Thought Prompting fo... | 2024-01-05 | Code |
| 29 | NegBLIP | 42.5 | No | Incorporating Structured Representations into Pr... | 2023-05-10 | - |
| 30 | IAIS large (Flickr30k) | 42.5 | No | - | - | - |
| 31 | LLaVA-1.5-CCoT | 42 | No | Compositional Chain-of-Thought Prompting for Lar... | 2023-11-27 | Code |
| 32 | BLIP2 | 42 | No | Incorporating Structured Representations into Pr... | 2023-05-10 | - |
| 33 | IAIS large (COCO) | 41.75 | No | - | - | - |
| 34 | NegBLIP2 | 41.5 | No | Incorporating Structured Representations into Pr... | 2023-05-10 | - |
| 35 | BLIP (+Graph Text, +Graph Neg) | 40.5 | No | Incorporating Structured Representations into Pr... | 2023-05-10 | - |
| 36 | BLIP (+Graph Text) | 40.3 | No | Incorporating Structured Representations into Pr... | 2023-05-10 | - |
| 37 | Gemini + CoCoT | 40 | No | CoCoT: Contrastive Chain-of-Thought Prompting fo... | 2024-01-05 | Code |
| 38 | CACR base | 39.25 | No | - | - | - |
| 39 | METER | 39.25 | No | Equivariant Similarity for Vision-Language Found... | 2023-03-25 | Code |
| 40 | OpenFlamingo | 39 | No | CoCoT: Contrastive Chain-of-Thought Prompting fo... | 2024-01-05 | Code |
| 41 | BLIP | 39 | No | Incorporating Structured Representations into Pr... | 2023-05-10 | - |
| 42 | GPT-4V (image-caption match answer yes/no, zero-shot) | 38 | No | - | - | - |
| 43 | UNITER large | 38 | No | Winoground: Probing Vision and Language Models f... | 2022-04-07 | Code |
| 44 | VinVL | 37.75 | No | Winoground: Probing Vision and Language Models f... | 2022-04-07 | Code |
| 45 | ViLLA large | 37 | No | Winoground: Probing Vision and Language Models f... | 2022-04-07 | Code |
| 46 | BLIP (VisualGPTScore, α-tuned) | 36.5 | No | Revisiting the Role of Language Priors in Vision... | 2023-06-02 | Code |
| 47 | BLIP 14M | 36.5 | No | Measuring Progress in Fine-grained Vision-and-La... | 2023-05-12 | Code |
| 48 | ViT-B/16 + BERT base + ViLEM | 36.5 | No | - | - | - |
| 49 | LLaVA-1.5 | 36 | No | Compositional Chain-of-Thought Prompting for Lar... | 2023-11-27 | Code |
| 50 | BLIP (ITM) | 35.8 | No | Revisiting the Role of Language Priors in Vision... | 2023-06-02 | Code |
| 51 | BLIP 129M | 35.5 | No | Measuring Progress in Fine-grained Vision-and-La... | 2023-05-12 | Code |
| 52 | ROSITA (Flickr30k) | 35.25 | No | - | - | - |
| 53 | ViLT (ViT-B/32) | 34.75 | No | Winoground: Probing Vision and Language Models f... | 2022-04-07 | Code |
| 54 | BLIP 129M (CapFilt/L) | 34.7 | No | Measuring Progress in Fine-grained Vision-and-La... | 2023-05-12 | Code |
| 55 | BLIP-ViT/L 129M | 34.7 | No | Measuring Progress in Fine-grained Vision-and-La... | 2023-05-12 | Code |
| 56 | Diffusion Classifier (zero-shot) | 34 | No | Your Diffusion Model is Secretly a Zero-Shot Cla... | 2023-03-28 | Code |
| 57 | PEVL 14M | 33.2 | No | Measuring Progress in Fine-grained Vision-and-La... | 2023-05-12 | Code |
| 58 | ALBEF 14M | 32.5 | No | Measuring Progress in Fine-grained Vision-and-La... | 2023-05-12 | Code |
| 59 | FLAVA (ITM) | 32.25 | No | Winoground: Probing Vision and Language Models f... | 2022-04-07 | Code |
| 60 | UNITER base | 32.25 | No | Winoground: Probing Vision and Language Models f... | 2022-04-07 | Code |
| 61 | CLIP (SGVL) | 32 | No | Incorporating Structured Representations into Pr... | 2023-05-10 | - |
| 62 | ViT-B/16 + BERT base | 31.2 | No | - | - | - |
| 63 | Gemini | 30.75 | No | CoCoT: Contrastive Chain-of-Thought Prompting fo... | 2024-01-05 | Code |
| 64 | OCLIP (ViT-H/14) | 30.75 | No | SelfEval: Leveraging the discriminative nature o... | 2023-11-17 | - |
| 65 | CLIP (ViT-B/32) | 30.75 | No | Winoground: Probing Vision and Language Models f... | 2022-04-07 | Code |
| 66 | OFA large (ITM) | 30.75 | No | Simple Token-Level Confidence Improves Caption C... | 2023-05-11 | - |
| 67 | KeyComp (GPT-3.5) | 30.3 | No | Prompting Large Vision-Language Models for Compo... | 2024-01-20 | Code |
| 68 | CLIP (ViT-L/14) | 30.25 | No | SelfEval: Leveraging the discriminative nature o... | 2023-11-17 | - |
| 69 | ViLLA base | 30 | No | Winoground: Probing Vision and Language Models f... | 2022-04-07 | Code |
| 70 | syn-CLIP | 30 | No | Going Beyond Nouns With Vision & Language Models... | 2023-03-30 | Code |
| 71 | syn-CyCLIP | 30 | No | Going Beyond Nouns With Vision & Language Models... | 2023-03-30 | Code |
| 72 | NegCLIP | 29.5 | No | Incorporating Structured Representations into Pr... | 2023-05-10 | - |
| 73 | OFA large (TLC-A) | 29.25 | No | Simple Token-Level Confidence Improves Caption C... | 2023-05-11 | - |
| 74 | ALBEF 4M | 29.2 | No | Measuring Progress in Fine-grained Vision-and-La... | 2023-05-12 | Code |
| 75 | LDM-T5 (SelfEval) | 29 | No | SelfEval: Leveraging the discriminative nature o... | 2023-11-17 | - |
| 76 | CyCLIP | 28.5 | No | Going Beyond Nouns With Vision & Language Models... | 2023-03-30 | Code |
| 77 | PDM-T5 (SelfEval) | 28.25 | No | SelfEval: Leveraging the discriminative nature o... | 2023-11-17 | - |
| 78 | COCA ViT-L14 (f.t on COCO) | 28.25 | No | What You See is What You Read? Improving Text-Im... | 2023-05-17 | Code |
| 79 | LLaVA-1.5-ZS-CoT | 28 | No | Compositional Chain-of-Thought Prompting for Lar... | 2023-11-27 | Code |
| 80 | BLIP (ITC) | 28 | No | Revisiting the Role of Language Priors in Vision... | 2023-06-02 | Code |
| 81 | OFA large (ft SNLI-VE) | 27.7 | No | What You See is What You Read? Improving Text-Im... | 2023-05-17 | Code |
| 82 | OFA base (ITM) | 26.75 | No | Simple Token-Level Confidence Improves Caption C... | 2023-05-11 | - |
| 83 | CLIP RN50x64 | 26.5 | No | What You See is What You Read? Improving Text-Im... | 2023-05-17 | Code |
| 84 | LLaVA-7B (GPTScore) | 25.5 | No | An Examination of the Compositionality of Large ... | 2023-08-21 | Code |
| 85 | FLAVA (contrastive) | 25.25 | No | Winoground: Probing Vision and Language Models f... | 2022-04-07 | Code |
| 86 | Random chance | 25 | No | Winoground: Probing Vision and Language Models f... | 2022-04-07 | Code |
| 87 | LLaVA | 24.8 | No | Incorporating Structured Representations into Pr... | 2023-05-10 | - |
| 88 | OFA base (TLC-A) | 24.5 | No | Simple Token-Level Confidence Improves Caption C... | 2023-05-11 | - |
| 89 | MiniGPT-4-7B (GPTScore) | 24.5 | No | An Examination of the Compositionality of Large ... | 2023-08-21 | Code |
| 90 | ViLBERT base | 23.75 | No | Winoground: Probing Vision and Language Models f... | 2022-04-07 | Code |
| 91 | MiniGPT-4 | 23.3 | No | Incorporating Structured Representations into Pr... | 2023-05-10 | - |
| 92 | MiniGPT-4-7B (VisualGPTScore) | 23.25 | No | An Examination of the Compositionality of Large ... | 2023-08-21 | Code |
| 93 | VSE++ (COCO, ResNet) | 22.75 | No | Winoground: Probing Vision and Language Models f... | 2022-04-07 | Code |
| 94 | OFA tiny (ITM) | 22.75 | No | Simple Token-Level Confidence Improves Caption C... | 2023-05-11 | - |
| 95 | LDM-CLIP (SelfEval) | 22.75 | No | SelfEval: Leveraging the discriminative nature o... | 2023-11-17 | - |
| 96 | Gemini + CCoT | 22.5 | No | CoCoT: Contrastive Chain-of-Thought Prompting fo... | 2024-01-05 | Code |
| 97 | InstructBLIP-CCoT | 21 | No | Compositional Chain-of-Thought Prompting for Lar... | 2023-11-27 | Code |
| 98 | VSRN (Flickr30k) | 20 | No | Winoground: Probing Vision and Language Models f... | 2022-04-07 | Code |
| 99 | VSE++ (Flickr30k, ResNet) | 20 | No | Winoground: Probing Vision and Language Models f... | 2022-04-07 | Code |
| 100 | VSE++ (Flickr30k, VGG) | 19.75 | No | Winoground: Probing Vision and Language Models f... | 2022-04-07 | Code |
| 101 | UniT (ITM finetuned) | 19.5 | No | Winoground: Probing Vision and Language Models f... | 2022-04-07 | Code |
| 102 | LXMERT | 19.25 | No | Winoground: Probing Vision and Language Models f... | 2022-04-07 | Code |
| 103 | TIFA | 19 | No | What You See is What You Read? Improving Text-Im... | 2023-05-17 | Code |
| 104 | IDEFICS 80B | 18.75 | No | - | - | - |
| 105 | VSE++ (COCO, VGG) | 18.75 | No | Winoground: Probing Vision and Language Models f... | 2022-04-07 | Code |
| 106 | VSRN (COCO) | 17.5 | No | Winoground: Probing Vision and Language Models f... | 2022-04-07 | Code |
| 107 | PDM-CLIP (SelfEval) | 17 | No | SelfEval: Leveraging the discriminative nature o... | 2023-11-17 | - |
| 108 | IDEFICS 9B | 16.8 | No | - | - | - |
| 109 | OFA tiny (TLC-A) | 16.5 | No | Simple Token-Level Confidence Improves Caption C... | 2023-05-11 | - |
| 110 | VisualBERT base | 15.5 | No | Winoground: Probing Vision and Language Models f... | 2022-04-07 | Code |
| 111 | MiniGPT-4-7B (BERTScore) | 14 | No | An Examination of the Compositionality of Large ... | 2023-08-21 | Code |
| 112 | LLaVA-7B (BERTScore) | 13.5 | No | An Examination of the Compositionality of Large ... | 2023-08-21 | Code |
| 113 | InstructBLIP-ZS-CoT | 9.3 | No | Compositional Chain-of-Thought Prompting for Lar... | 2023-11-27 | Code |
| 114 | InstructBLIP | 7 | No | Compositional Chain-of-Thought Prompting for Lar... | 2023-11-27 | Code |