Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality

Tristan Thrush, Ryan Jiang, Max Bartolo, Amanpreet Singh, Adina Williams, Douwe Kiela, Candace Ross

2022-04-07CVPR 2022 1Visual Reasoning

Abstract

We present a novel task and dataset for evaluating the ability of vision and language models to conduct visio-linguistic compositional reasoning, which we call Winoground. Given two images and two captions, the goal is to match them correctly - but crucially, both captions contain a completely identical set of words, only in a different order. The dataset was carefully hand-curated by expert annotators and is labeled with a rich set of fine-grained tags to assist in analyzing model performance. We probe a diverse range of state-of-the-art vision and language models and find that, surprisingly, none of them do much better than chance. Evidently, these models are not as skilled at visio-linguistic compositional reasoning as we might have hoped. We perform an extensive analysis to obtain insights into how future work might try to mitigate these models' shortcomings. We aim for Winoground to serve as a useful evaluation set for advancing the state of the art and driving further progress in the field. The dataset is available at https://huggingface.co/datasets/facebook/winoground.

Results

Task	Dataset	Metric	Value	Model
Visual Reasoning	Winoground	Group Score	10.5	UNITER large
Visual Reasoning	Winoground	Image Score	14	UNITER large
Visual Reasoning	Winoground	Text Score	38	UNITER large
Visual Reasoning	Winoground	Group Score	14.5	VinVL
Visual Reasoning	Winoground	Image Score	17.75	VinVL
Visual Reasoning	Winoground	Text Score	37.75	VinVL
Visual Reasoning	Winoground	Group Score	11	ViLLA large
Visual Reasoning	Winoground	Image Score	13.25	ViLLA large
Visual Reasoning	Winoground	Text Score	37	ViLLA large
Visual Reasoning	Winoground	Group Score	9.25	ViLT (ViT-B/32)
Visual Reasoning	Winoground	Image Score	14	ViLT (ViT-B/32)
Visual Reasoning	Winoground	Text Score	34.75	ViLT (ViT-B/32)
Visual Reasoning	Winoground	Group Score	14.25	FLAVA (ITM)
Visual Reasoning	Winoground	Image Score	20.5	FLAVA (ITM)
Visual Reasoning	Winoground	Text Score	32.25	FLAVA (ITM)
Visual Reasoning	Winoground	Group Score	10	UNITER base
Visual Reasoning	Winoground	Image Score	13.25	UNITER base
Visual Reasoning	Winoground	Text Score	32.25	UNITER base
Visual Reasoning	Winoground	Group Score	8	CLIP (ViT-B/32)
Visual Reasoning	Winoground	Image Score	10.5	CLIP (ViT-B/32)
Visual Reasoning	Winoground	Text Score	30.75	CLIP (ViT-B/32)
Visual Reasoning	Winoground	Group Score	8	ViLLA base
Visual Reasoning	Winoground	Image Score	12	ViLLA base
Visual Reasoning	Winoground	Text Score	30	ViLLA base
Visual Reasoning	Winoground	Group Score	9	FLAVA (contrastive)
Visual Reasoning	Winoground	Image Score	13.5	FLAVA (contrastive)
Visual Reasoning	Winoground	Text Score	25.25	FLAVA (contrastive)
Visual Reasoning	Winoground	Group Score	16.67	Random chance
Visual Reasoning	Winoground	Image Score	25	Random chance
Visual Reasoning	Winoground	Text Score	25	Random chance
Visual Reasoning	Winoground	Group Score	4.75	ViLBERT base
Visual Reasoning	Winoground	Image Score	7.25	ViLBERT base
Visual Reasoning	Winoground	Text Score	23.75	ViLBERT base
Visual Reasoning	Winoground	Group Score	4	VSE++ (COCO, ResNet)
Visual Reasoning	Winoground	Image Score	8	VSE++ (COCO, ResNet)
Visual Reasoning	Winoground	Text Score	22.75	VSE++ (COCO, ResNet)
Visual Reasoning	Winoground	Group Score	3.5	VSRN (Flickr30k)
Visual Reasoning	Winoground	Image Score	5	VSRN (Flickr30k)
Visual Reasoning	Winoground	Text Score	20	VSRN (Flickr30k)
Visual Reasoning	Winoground	Group Score	2.75	VSE++ (Flickr30k, ResNet)
Visual Reasoning	Winoground	Image Score	5	VSE++ (Flickr30k, ResNet)
Visual Reasoning	Winoground	Text Score	20	VSE++ (Flickr30k, ResNet)
Visual Reasoning	Winoground	Group Score	4.5	VSE++ (Flickr30k, VGG)
Visual Reasoning	Winoground	Image Score	6.25	VSE++ (Flickr30k, VGG)
Visual Reasoning	Winoground	Text Score	19.75	VSE++ (Flickr30k, VGG)
Visual Reasoning	Winoground	Group Score	4	UniT (ITM finetuned)
Visual Reasoning	Winoground	Image Score	6.25	UniT (ITM finetuned)
Visual Reasoning	Winoground	Text Score	19.5	UniT (ITM finetuned)
Visual Reasoning	Winoground	Group Score	4	LXMERT
Visual Reasoning	Winoground	Image Score	7	LXMERT
Visual Reasoning	Winoground	Text Score	19.25	LXMERT
Visual Reasoning	Winoground	Group Score	3.5	VSE++ (COCO, VGG)
Visual Reasoning	Winoground	Image Score	5.5	VSE++ (COCO, VGG)
Visual Reasoning	Winoground	Text Score	18.75	VSE++ (COCO, VGG)
Visual Reasoning	Winoground	Group Score	3.75	VSRN (COCO)
Visual Reasoning	Winoground	Image Score	7	VSRN (COCO)
Visual Reasoning	Winoground	Text Score	17.5	VSRN (COCO)
Visual Reasoning	Winoground	Group Score	1.5	VisualBERT base
Visual Reasoning	Winoground	Image Score	2.5	VisualBERT base
Visual Reasoning	Winoground	Text Score	15.5	VisualBERT base

Abstract

Results

Task	Dataset	Metric	Value	Model
Visual Reasoning	Winoground	Group Score	10.5	UNITER large
Visual Reasoning	Winoground	Image Score	14	UNITER large
Visual Reasoning	Winoground	Text Score	38	UNITER large
Visual Reasoning	Winoground	Group Score	14.5	VinVL
Visual Reasoning	Winoground	Image Score	17.75	VinVL
Visual Reasoning	Winoground	Text Score	37.75	VinVL
Visual Reasoning	Winoground	Group Score	11	ViLLA large
Visual Reasoning	Winoground	Image Score	13.25	ViLLA large
Visual Reasoning	Winoground	Text Score	37	ViLLA large
Visual Reasoning	Winoground	Group Score	9.25	ViLT (ViT-B/32)
Visual Reasoning	Winoground	Image Score	14	ViLT (ViT-B/32)
Visual Reasoning	Winoground	Text Score	34.75	ViLT (ViT-B/32)
Visual Reasoning	Winoground	Group Score	14.25	FLAVA (ITM)
Visual Reasoning	Winoground	Image Score	20.5	FLAVA (ITM)
Visual Reasoning	Winoground	Text Score	32.25	FLAVA (ITM)
Visual Reasoning	Winoground	Group Score	10	UNITER base
Visual Reasoning	Winoground	Image Score	13.25	UNITER base
Visual Reasoning	Winoground	Text Score	32.25	UNITER base
Visual Reasoning	Winoground	Group Score	8	CLIP (ViT-B/32)
Visual Reasoning	Winoground	Image Score	10.5	CLIP (ViT-B/32)
Visual Reasoning	Winoground	Text Score	30.75	CLIP (ViT-B/32)
Visual Reasoning	Winoground	Group Score	8	ViLLA base
Visual Reasoning	Winoground	Image Score	12	ViLLA base
Visual Reasoning	Winoground	Text Score	30	ViLLA base
Visual Reasoning	Winoground	Group Score	9	FLAVA (contrastive)
Visual Reasoning	Winoground	Image Score	13.5	FLAVA (contrastive)
Visual Reasoning	Winoground	Text Score	25.25	FLAVA (contrastive)
Visual Reasoning	Winoground	Group Score	16.67	Random chance
Visual Reasoning	Winoground	Image Score	25	Random chance
Visual Reasoning	Winoground	Text Score	25	Random chance
Visual Reasoning	Winoground	Group Score	4.75	ViLBERT base
Visual Reasoning	Winoground	Image Score	7.25	ViLBERT base
Visual Reasoning	Winoground	Text Score	23.75	ViLBERT base
Visual Reasoning	Winoground	Group Score	4	VSE++ (COCO, ResNet)
Visual Reasoning	Winoground	Image Score	8	VSE++ (COCO, ResNet)
Visual Reasoning	Winoground	Text Score	22.75	VSE++ (COCO, ResNet)
Visual Reasoning	Winoground	Group Score	3.5	VSRN (Flickr30k)
Visual Reasoning	Winoground	Image Score	5	VSRN (Flickr30k)
Visual Reasoning	Winoground	Text Score	20	VSRN (Flickr30k)
Visual Reasoning	Winoground	Group Score	2.75	VSE++ (Flickr30k, ResNet)
Visual Reasoning	Winoground	Image Score	5	VSE++ (Flickr30k, ResNet)
Visual Reasoning	Winoground	Text Score	20	VSE++ (Flickr30k, ResNet)
Visual Reasoning	Winoground	Group Score	4.5	VSE++ (Flickr30k, VGG)
Visual Reasoning	Winoground	Image Score	6.25	VSE++ (Flickr30k, VGG)
Visual Reasoning	Winoground	Text Score	19.75	VSE++ (Flickr30k, VGG)
Visual Reasoning	Winoground	Group Score	4	UniT (ITM finetuned)
Visual Reasoning	Winoground	Image Score	6.25	UniT (ITM finetuned)
Visual Reasoning	Winoground	Text Score	19.5	UniT (ITM finetuned)
Visual Reasoning	Winoground	Group Score	4	LXMERT
Visual Reasoning	Winoground	Image Score	7	LXMERT
Visual Reasoning	Winoground	Text Score	19.25	LXMERT
Visual Reasoning	Winoground	Group Score	3.5	VSE++ (COCO, VGG)
Visual Reasoning	Winoground	Image Score	5.5	VSE++ (COCO, VGG)
Visual Reasoning	Winoground	Text Score	18.75	VSE++ (COCO, VGG)
Visual Reasoning	Winoground	Group Score	3.75	VSRN (COCO)
Visual Reasoning	Winoground	Image Score	7	VSRN (COCO)
Visual Reasoning	Winoground	Text Score	17.5	VSRN (COCO)
Visual Reasoning	Winoground	Group Score	1.5	VisualBERT base
Visual Reasoning	Winoground	Image Score	2.5	VisualBERT base
Visual Reasoning	Winoground	Text Score	15.5	VisualBERT base

Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality

Abstract

Results

Related Papers

Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality

Abstract

Results

Related Papers