Tristan Thrush, Ryan Jiang, Max Bartolo, Amanpreet Singh, Adina Williams, Douwe Kiela, Candace Ross
We present a novel task and dataset for evaluating the ability of vision and language models to conduct visio-linguistic compositional reasoning, which we call Winoground. Given two images and two captions, the goal is to match them correctly - but crucially, both captions contain a completely identical set of words, only in a different order. The dataset was carefully hand-curated by expert annotators and is labeled with a rich set of fine-grained tags to assist in analyzing model performance. We probe a diverse range of state-of-the-art vision and language models and find that, surprisingly, none of them do much better than chance. Evidently, these models are not as skilled at visio-linguistic compositional reasoning as we might have hoped. We perform an extensive analysis to obtain insights into how future work might try to mitigate these models' shortcomings. We aim for Winoground to serve as a useful evaluation set for advancing the state of the art and driving further progress in the field. The dataset is available at https://huggingface.co/datasets/facebook/winoground.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Visual Reasoning | Winoground | Group Score | 10.5 | UNITER large |
| Visual Reasoning | Winoground | Image Score | 14 | UNITER large |
| Visual Reasoning | Winoground | Text Score | 38 | UNITER large |
| Visual Reasoning | Winoground | Group Score | 14.5 | VinVL |
| Visual Reasoning | Winoground | Image Score | 17.75 | VinVL |
| Visual Reasoning | Winoground | Text Score | 37.75 | VinVL |
| Visual Reasoning | Winoground | Group Score | 11 | ViLLA large |
| Visual Reasoning | Winoground | Image Score | 13.25 | ViLLA large |
| Visual Reasoning | Winoground | Text Score | 37 | ViLLA large |
| Visual Reasoning | Winoground | Group Score | 9.25 | ViLT (ViT-B/32) |
| Visual Reasoning | Winoground | Image Score | 14 | ViLT (ViT-B/32) |
| Visual Reasoning | Winoground | Text Score | 34.75 | ViLT (ViT-B/32) |
| Visual Reasoning | Winoground | Group Score | 14.25 | FLAVA (ITM) |
| Visual Reasoning | Winoground | Image Score | 20.5 | FLAVA (ITM) |
| Visual Reasoning | Winoground | Text Score | 32.25 | FLAVA (ITM) |
| Visual Reasoning | Winoground | Group Score | 10 | UNITER base |
| Visual Reasoning | Winoground | Image Score | 13.25 | UNITER base |
| Visual Reasoning | Winoground | Text Score | 32.25 | UNITER base |
| Visual Reasoning | Winoground | Group Score | 8 | CLIP (ViT-B/32) |
| Visual Reasoning | Winoground | Image Score | 10.5 | CLIP (ViT-B/32) |
| Visual Reasoning | Winoground | Text Score | 30.75 | CLIP (ViT-B/32) |
| Visual Reasoning | Winoground | Group Score | 8 | ViLLA base |
| Visual Reasoning | Winoground | Image Score | 12 | ViLLA base |
| Visual Reasoning | Winoground | Text Score | 30 | ViLLA base |
| Visual Reasoning | Winoground | Group Score | 9 | FLAVA (contrastive) |
| Visual Reasoning | Winoground | Image Score | 13.5 | FLAVA (contrastive) |
| Visual Reasoning | Winoground | Text Score | 25.25 | FLAVA (contrastive) |
| Visual Reasoning | Winoground | Group Score | 16.67 | Random chance |
| Visual Reasoning | Winoground | Image Score | 25 | Random chance |
| Visual Reasoning | Winoground | Text Score | 25 | Random chance |
| Visual Reasoning | Winoground | Group Score | 4.75 | ViLBERT base |
| Visual Reasoning | Winoground | Image Score | 7.25 | ViLBERT base |
| Visual Reasoning | Winoground | Text Score | 23.75 | ViLBERT base |
| Visual Reasoning | Winoground | Group Score | 4 | VSE++ (COCO, ResNet) |
| Visual Reasoning | Winoground | Image Score | 8 | VSE++ (COCO, ResNet) |
| Visual Reasoning | Winoground | Text Score | 22.75 | VSE++ (COCO, ResNet) |
| Visual Reasoning | Winoground | Group Score | 3.5 | VSRN (Flickr30k) |
| Visual Reasoning | Winoground | Image Score | 5 | VSRN (Flickr30k) |
| Visual Reasoning | Winoground | Text Score | 20 | VSRN (Flickr30k) |
| Visual Reasoning | Winoground | Group Score | 2.75 | VSE++ (Flickr30k, ResNet) |
| Visual Reasoning | Winoground | Image Score | 5 | VSE++ (Flickr30k, ResNet) |
| Visual Reasoning | Winoground | Text Score | 20 | VSE++ (Flickr30k, ResNet) |
| Visual Reasoning | Winoground | Group Score | 4.5 | VSE++ (Flickr30k, VGG) |
| Visual Reasoning | Winoground | Image Score | 6.25 | VSE++ (Flickr30k, VGG) |
| Visual Reasoning | Winoground | Text Score | 19.75 | VSE++ (Flickr30k, VGG) |
| Visual Reasoning | Winoground | Group Score | 4 | UniT (ITM finetuned) |
| Visual Reasoning | Winoground | Image Score | 6.25 | UniT (ITM finetuned) |
| Visual Reasoning | Winoground | Text Score | 19.5 | UniT (ITM finetuned) |
| Visual Reasoning | Winoground | Group Score | 4 | LXMERT |
| Visual Reasoning | Winoground | Image Score | 7 | LXMERT |
| Visual Reasoning | Winoground | Text Score | 19.25 | LXMERT |
| Visual Reasoning | Winoground | Group Score | 3.5 | VSE++ (COCO, VGG) |
| Visual Reasoning | Winoground | Image Score | 5.5 | VSE++ (COCO, VGG) |
| Visual Reasoning | Winoground | Text Score | 18.75 | VSE++ (COCO, VGG) |
| Visual Reasoning | Winoground | Group Score | 3.75 | VSRN (COCO) |
| Visual Reasoning | Winoground | Image Score | 7 | VSRN (COCO) |
| Visual Reasoning | Winoground | Text Score | 17.5 | VSRN (COCO) |
| Visual Reasoning | Winoground | Group Score | 1.5 | VisualBERT base |
| Visual Reasoning | Winoground | Image Score | 2.5 | VisualBERT base |
| Visual Reasoning | Winoground | Text Score | 15.5 | VisualBERT base |