TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Breaking Common Sense: WHOOPS! A Vision-and-Language Bench...

Breaking Common Sense: WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images

Nitzan Bitton-Guetta, Yonatan Bitton, Jack Hessel, Ludwig Schmidt, Yuval Elovici, Gabriel Stanovsky, Roy Schwartz

2023-03-13ICCV 2023 1Question AnsweringCommon Sense ReasoningExplanation GenerationImage CaptioningImage-to-Text RetrievalImage GenerationVisual Question Answering (VQA)Visual Commonsense ReasoningVisual Question Answering
PaperPDF

Abstract

Weird, unusual, and uncanny images pique the curiosity of observers because they challenge commonsense. For example, an image released during the 2022 world cup depicts the famous soccer stars Lionel Messi and Cristiano Ronaldo playing chess, which playfully violates our expectation that their competition should occur on the football field. Humans can easily recognize and interpret these unconventional images, but can AI models do the same? We introduce WHOOPS!, a new dataset and benchmark for visual commonsense. The dataset is comprised of purposefully commonsense-defying images created by designers using publicly-available image generation tools like Midjourney. We consider several tasks posed over the dataset. In addition to image captioning, cross-modal matching, and visual question answering, we introduce a difficult explanation generation task, where models must identify and explain why a given image is unusual. Our results show that state-of-the-art models such as GPT3 and BLIP2 still lag behind human performance on WHOOPS!. We hope our dataset will inspire the development of AI models with stronger visual commonsense reasoning abilities. Data, models and code are available at the project website: whoops-benchmark.github.io

Results

TaskDatasetMetricValueModel
Visual Question Answering (VQA)WHOOPS!BEM57BLIP2 FlanT5-XXL (Fine-tuned)
Visual Question Answering (VQA)WHOOPS!Exact Match21BLIP2 FlanT5-XXL (Fine-tuned)
Visual Question Answering (VQA)WHOOPS!BEM55BLIP2 FlanT5-XL (Fine-tuned)
Visual Question Answering (VQA)WHOOPS!Exact Match20BLIP2 FlanT5-XL (Fine-tuned)
Visual Question Answering (VQA)WHOOPS!BEM55BLIP2 FlanT5-XXL (Zero-shot)
Visual Question Answering (VQA)WHOOPS!Exact Match15BLIP2 FlanT5-XXL (Zero-shot)
Visual Question Answering (VQA)WHOOPS!BEM38OFA Large
Visual Question Answering (VQA)WHOOPS!Exact Match8OFA Large
Visual Question Answering (VQA)WHOOPS!BEM39BLIP Large
Visual Question Answering (VQA)WHOOPS!Exact Match6BLIP Large
Visual Question Answering (VQA)WHOOPS!BEM24BLIP2 FlanT5-XXL (Text-only FT)
Visual Question Answering (VQA)WHOOPS!Exact Match4BLIP2 FlanT5-XXL (Text-only FT)
Image CaptioningWHOOPS!BLEU-442BLIP2 FlanT5-XXL (Fine-tuned)
Image CaptioningWHOOPS!CIDEr177BLIP2 FlanT5-XXL (Fine-tuned)
Image CaptioningWHOOPS!BLEU-441BLIP2 FlanT5-XL (Fine-tuned)
Image CaptioningWHOOPS!CIDEr174BLIP2 FlanT5-XL (Fine-tuned)
Image CaptioningWHOOPS!BLEU-431BLIP2 FlanT5-XXL (Zero-Shot)
Image CaptioningWHOOPS!CIDEr120BLIP2 FlanT5-XXL (Zero-Shot)
Image CaptioningWHOOPS!BLEU-425CoCa ViT-L-14 MSCOCO
Image CaptioningWHOOPS!CIDEr102CoCa ViT-L-14 MSCOCO
Image CaptioningWHOOPS!BLEU-413BLIP Large
Image CaptioningWHOOPS!CIDEr65BLIP Large
Explanation GenerationWHOOPS!Human (%)68Ground-truth Caption -> GPT3 (Oracle)
Explanation GenerationWHOOPS!Human (%)33Predicted Caption -> GPT3
Explanation GenerationWHOOPS!Human (%)27BLIP2 FlanT5-XXL (Fine-tuned)
Explanation GenerationWHOOPS!Human (%)15BLIP2 FlanT5-XL (Fine-tuned)
Image-to-Text RetrievalWHOOPS!Specificity94BLIP2 FlanT5-XXL (Text-only FT)
Image-to-Text RetrievalWHOOPS!Specificity84BLIP2 FlanT5-XXL (Fine-tuned)
Image-to-Text RetrievalWHOOPS!Specificity81BLIP2 FlanT5-XL (Fine-tuned)
Image-to-Text RetrievalWHOOPS!Specificity77BLIP Large
Image-to-Text RetrievalWHOOPS!Specificity72CoCa ViT-L-14 MSCOCO
Image-to-Text RetrievalWHOOPS!Specificity71BLIP2 FlanT5-XXL (Zero-shot)
Image-to-Text RetrievalWHOOPS!Specificity70CLIP ViT-L/14

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17Comparing Apples to Oranges: A Dataset & Analysis of LLM Humour Understanding from Traditional Puns to Topical Jokes2025-07-17fastWDM3D: Fast and Accurate 3D Healthy Tissue Inpainting2025-07-17Synthesizing Reality: Leveraging the Generative AI-Powered Platform Midjourney for Construction Worker Detection2025-07-17FashionPose: Text to Pose to Relight Image Generation for Personalized Fashion Visualization2025-07-17