TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/WinoGAViL: Gamified Association Benchmark to Challenge Vis...

WinoGAViL: Gamified Association Benchmark to Challenge Vision-and-Language Models

Yonatan Bitton, Nitzan Bitton Guetta, Ron Yosef, Yuval Elovici, Mohit Bansal, Gabriel Stanovsky, Roy Schwartz

2022-07-25Question AnsweringCommon Sense ReasoningVisual ReasoningVisual Question Answering (VQA)Multimodal AssociationGeneral KnowledgeVisual Question Answering
PaperPDFCode(official)

Abstract

While vision-and-language models perform well on tasks such as visual question answering, they struggle when it comes to basic human commonsense reasoning skills. In this work, we introduce WinoGAViL: an online game of vision-and-language associations (e.g., between werewolves and a full moon), used as a dynamic evaluation benchmark. Inspired by the popular card game Codenames, a spymaster gives a textual cue related to several visual candidates, and another player tries to identify them. Human players are rewarded for creating associations that are challenging for a rival AI model but still solvable by other human players. We use the game to collect 3.5K instances, finding that they are intuitive for humans (>90% Jaccard index) but challenging for state-of-the-art AI models, where the best model (ViLT) achieves a score of 52%, succeeding mostly where the cue is visually salient. Our analysis as well as the feedback we collect from players indicate that the collected associations require diverse reasoning skills, including general knowledge, common sense, abstraction, and more. We release the dataset, the code and the interactive game, allowing future data collection that can be used to develop models with better association abilities.

Results

TaskDatasetMetricValueModel
Visual ReasoningWinoGAViLJaccard Index90Humans
Visual ReasoningWinoGAViLJaccard Index52ViLT (Zero-Shot)
Visual ReasoningWinoGAViLJaccard Index46X-VLM (Zero-Shot)
Visual ReasoningWinoGAViLJaccard Index41CLIP-ViT-B/32 (Zero-Shot)
Visual ReasoningWinoGAViLJaccard Index40CLIP-ViT-L/14 (Zero-Shot)
Visual ReasoningWinoGAViLJaccard Index38CLIP-RN50x64/14 (Zero-Shot)
Visual ReasoningWinoGAViLJaccard Index35CLIP-RN50 (Zero-Shot)
Visual ReasoningWinoGAViLJaccard Index15CLIP-ViL (Zero-Shot)
Common Sense ReasoningWinoGAViLJaccard Index52ViLT

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17Comparing Apples to Oranges: A Dataset & Analysis of LLM Humour Understanding from Traditional Puns to Topical Jokes2025-07-17LaViPlan : Language-Guided Visual Path Planning with RLVR2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16