TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Winoground: Probing Vision and Language Models for Visio-L...

Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality

Tristan Thrush, Ryan Jiang, Max Bartolo, Amanpreet Singh, Adina Williams, Douwe Kiela, Candace Ross

2022-04-07CVPR 2022 1Visual Reasoning
PaperPDFCodeCode

Abstract

We present a novel task and dataset for evaluating the ability of vision and language models to conduct visio-linguistic compositional reasoning, which we call Winoground. Given two images and two captions, the goal is to match them correctly - but crucially, both captions contain a completely identical set of words, only in a different order. The dataset was carefully hand-curated by expert annotators and is labeled with a rich set of fine-grained tags to assist in analyzing model performance. We probe a diverse range of state-of-the-art vision and language models and find that, surprisingly, none of them do much better than chance. Evidently, these models are not as skilled at visio-linguistic compositional reasoning as we might have hoped. We perform an extensive analysis to obtain insights into how future work might try to mitigate these models' shortcomings. We aim for Winoground to serve as a useful evaluation set for advancing the state of the art and driving further progress in the field. The dataset is available at https://huggingface.co/datasets/facebook/winoground.

Results

TaskDatasetMetricValueModel
Visual ReasoningWinogroundGroup Score10.5UNITER large
Visual ReasoningWinogroundImage Score14UNITER large
Visual ReasoningWinogroundText Score38UNITER large
Visual ReasoningWinogroundGroup Score14.5VinVL
Visual ReasoningWinogroundImage Score17.75VinVL
Visual ReasoningWinogroundText Score37.75VinVL
Visual ReasoningWinogroundGroup Score11ViLLA large
Visual ReasoningWinogroundImage Score13.25ViLLA large
Visual ReasoningWinogroundText Score37ViLLA large
Visual ReasoningWinogroundGroup Score9.25ViLT (ViT-B/32)
Visual ReasoningWinogroundImage Score14ViLT (ViT-B/32)
Visual ReasoningWinogroundText Score34.75ViLT (ViT-B/32)
Visual ReasoningWinogroundGroup Score14.25FLAVA (ITM)
Visual ReasoningWinogroundImage Score20.5FLAVA (ITM)
Visual ReasoningWinogroundText Score32.25FLAVA (ITM)
Visual ReasoningWinogroundGroup Score10UNITER base
Visual ReasoningWinogroundImage Score13.25UNITER base
Visual ReasoningWinogroundText Score32.25UNITER base
Visual ReasoningWinogroundGroup Score8CLIP (ViT-B/32)
Visual ReasoningWinogroundImage Score10.5CLIP (ViT-B/32)
Visual ReasoningWinogroundText Score30.75CLIP (ViT-B/32)
Visual ReasoningWinogroundGroup Score8ViLLA base
Visual ReasoningWinogroundImage Score12ViLLA base
Visual ReasoningWinogroundText Score30ViLLA base
Visual ReasoningWinogroundGroup Score9FLAVA (contrastive)
Visual ReasoningWinogroundImage Score13.5FLAVA (contrastive)
Visual ReasoningWinogroundText Score25.25FLAVA (contrastive)
Visual ReasoningWinogroundGroup Score16.67Random chance
Visual ReasoningWinogroundImage Score25Random chance
Visual ReasoningWinogroundText Score25Random chance
Visual ReasoningWinogroundGroup Score4.75ViLBERT base
Visual ReasoningWinogroundImage Score7.25ViLBERT base
Visual ReasoningWinogroundText Score23.75ViLBERT base
Visual ReasoningWinogroundGroup Score4VSE++ (COCO, ResNet)
Visual ReasoningWinogroundImage Score8VSE++ (COCO, ResNet)
Visual ReasoningWinogroundText Score22.75VSE++ (COCO, ResNet)
Visual ReasoningWinogroundGroup Score3.5VSRN (Flickr30k)
Visual ReasoningWinogroundImage Score5VSRN (Flickr30k)
Visual ReasoningWinogroundText Score20VSRN (Flickr30k)
Visual ReasoningWinogroundGroup Score2.75VSE++ (Flickr30k, ResNet)
Visual ReasoningWinogroundImage Score5VSE++ (Flickr30k, ResNet)
Visual ReasoningWinogroundText Score20VSE++ (Flickr30k, ResNet)
Visual ReasoningWinogroundGroup Score4.5VSE++ (Flickr30k, VGG)
Visual ReasoningWinogroundImage Score6.25VSE++ (Flickr30k, VGG)
Visual ReasoningWinogroundText Score19.75VSE++ (Flickr30k, VGG)
Visual ReasoningWinogroundGroup Score4UniT (ITM finetuned)
Visual ReasoningWinogroundImage Score6.25UniT (ITM finetuned)
Visual ReasoningWinogroundText Score19.5UniT (ITM finetuned)
Visual ReasoningWinogroundGroup Score4LXMERT
Visual ReasoningWinogroundImage Score7LXMERT
Visual ReasoningWinogroundText Score19.25LXMERT
Visual ReasoningWinogroundGroup Score3.5VSE++ (COCO, VGG)
Visual ReasoningWinogroundImage Score5.5VSE++ (COCO, VGG)
Visual ReasoningWinogroundText Score18.75VSE++ (COCO, VGG)
Visual ReasoningWinogroundGroup Score3.75VSRN (COCO)
Visual ReasoningWinogroundImage Score7VSRN (COCO)
Visual ReasoningWinogroundText Score17.5VSRN (COCO)
Visual ReasoningWinogroundGroup Score1.5VisualBERT base
Visual ReasoningWinogroundImage Score2.5VisualBERT base
Visual ReasoningWinogroundText Score15.5VisualBERT base

Related Papers

LaViPlan : Language-Guided Visual Path Planning with RLVR2025-07-17Beyond Task-Specific Reasoning: A Unified Conditional Generative Framework for Abstract Visual Reasoning2025-07-15PyVision: Agentic Vision with Dynamic Tooling2025-07-10Orchestrator-Agent Trust: A Modular Agentic AI Visual Classification System with Trust-Aware Orchestration and RAG-Based Reasoning2025-07-09MagiC: Evaluating Multimodal Cognition Toward Grounded Visual Reasoning2025-07-09Skywork-R1V3 Technical Report2025-07-08High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning2025-07-08Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for Visual Reasoning2025-07-07