Breaking Common Sense: WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images

Nitzan Bitton-Guetta, Yonatan Bitton, Jack Hessel, Ludwig Schmidt, Yuval Elovici, Gabriel Stanovsky, Roy Schwartz

2023-03-13ICCV 2023 1Question Answering Common Sense Reasoning Explanation Generation Image Captioning Image-to-Text Retrieval Image Generation Visual Question Answering (VQA)Visual Commonsense Reasoning Visual Question Answering

Paper PDF

Abstract

Weird, unusual, and uncanny images pique the curiosity of observers because they challenge commonsense. For example, an image released during the 2022 world cup depicts the famous soccer stars Lionel Messi and Cristiano Ronaldo playing chess, which playfully violates our expectation that their competition should occur on the football field. Humans can easily recognize and interpret these unconventional images, but can AI models do the same? We introduce WHOOPS!, a new dataset and benchmark for visual commonsense. The dataset is comprised of purposefully commonsense-defying images created by designers using publicly-available image generation tools like Midjourney. We consider several tasks posed over the dataset. In addition to image captioning, cross-modal matching, and visual question answering, we introduce a difficult explanation generation task, where models must identify and explain why a given image is unusual. Our results show that state-of-the-art models such as GPT3 and BLIP2 still lag behind human performance on WHOOPS!. We hope our dataset will inspire the development of AI models with stronger visual commonsense reasoning abilities. Data, models and code are available at the project website: whoops-benchmark.github.io

Results

Task	Dataset	Metric	Value	Model
Visual Question Answering (VQA)	WHOOPS!	BEM	57	BLIP2 FlanT5-XXL (Fine-tuned)
Visual Question Answering (VQA)	WHOOPS!	Exact Match	21	BLIP2 FlanT5-XXL (Fine-tuned)
Visual Question Answering (VQA)	WHOOPS!	BEM	55	BLIP2 FlanT5-XL (Fine-tuned)
Visual Question Answering (VQA)	WHOOPS!	Exact Match	20	BLIP2 FlanT5-XL (Fine-tuned)
Visual Question Answering (VQA)	WHOOPS!	BEM	55	BLIP2 FlanT5-XXL (Zero-shot)
Visual Question Answering (VQA)	WHOOPS!	Exact Match	15	BLIP2 FlanT5-XXL (Zero-shot)
Visual Question Answering (VQA)	WHOOPS!	BEM	38	OFA Large
Visual Question Answering (VQA)	WHOOPS!	Exact Match	8	OFA Large
Visual Question Answering (VQA)	WHOOPS!	BEM	39	BLIP Large
Visual Question Answering (VQA)	WHOOPS!	Exact Match	6	BLIP Large
Visual Question Answering (VQA)	WHOOPS!	BEM	24	BLIP2 FlanT5-XXL (Text-only FT)
Visual Question Answering (VQA)	WHOOPS!	Exact Match	4	BLIP2 FlanT5-XXL (Text-only FT)
Image Captioning	WHOOPS!	BLEU-4	42	BLIP2 FlanT5-XXL (Fine-tuned)
Image Captioning	WHOOPS!	CIDEr	177	BLIP2 FlanT5-XXL (Fine-tuned)
Image Captioning	WHOOPS!	BLEU-4	41	BLIP2 FlanT5-XL (Fine-tuned)
Image Captioning	WHOOPS!	CIDEr	174	BLIP2 FlanT5-XL (Fine-tuned)
Image Captioning	WHOOPS!	BLEU-4	31	BLIP2 FlanT5-XXL (Zero-Shot)
Image Captioning	WHOOPS!	CIDEr	120	BLIP2 FlanT5-XXL (Zero-Shot)
Image Captioning	WHOOPS!	BLEU-4	25	CoCa ViT-L-14 MSCOCO
Image Captioning	WHOOPS!	CIDEr	102	CoCa ViT-L-14 MSCOCO
Image Captioning	WHOOPS!	BLEU-4	13	BLIP Large
Image Captioning	WHOOPS!	CIDEr	65	BLIP Large
Explanation Generation	WHOOPS!	Human (%)	68	Ground-truth Caption -> GPT3 (Oracle)
Explanation Generation	WHOOPS!	Human (%)	33	Predicted Caption -> GPT3
Explanation Generation	WHOOPS!	Human (%)	27	BLIP2 FlanT5-XXL (Fine-tuned)
Explanation Generation	WHOOPS!	Human (%)	15	BLIP2 FlanT5-XL (Fine-tuned)
Image-to-Text Retrieval	WHOOPS!	Specificity	94	BLIP2 FlanT5-XXL (Text-only FT)
Image-to-Text Retrieval	WHOOPS!	Specificity	84	BLIP2 FlanT5-XXL (Fine-tuned)
Image-to-Text Retrieval	WHOOPS!	Specificity	81	BLIP2 FlanT5-XL (Fine-tuned)
Image-to-Text Retrieval	WHOOPS!	Specificity	77	BLIP Large
Image-to-Text Retrieval	WHOOPS!	Specificity	72	CoCa ViT-L-14 MSCOCO
Image-to-Text Retrieval	WHOOPS!	Specificity	71	BLIP2 FlanT5-XXL (Zero-shot)
Image-to-Text Retrieval	WHOOPS!	Specificity	70	CLIP ViT-L/14

Breaking Common Sense: WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images

Abstract

Results

Related Papers

Breaking Common Sense: WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images

Abstract

Results

Related Papers