TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/IllusionVQA: A Challenging Optical Illusion Dataset for Vi...

IllusionVQA: A Challenging Optical Illusion Dataset for Vision Language Models

HAZ Sameen Shahgir, Khondker Salman Sayeed, Abhik Bhattacharjee, Wasi Uddin Ahmad, Yue Dong, Rifat Shahriyar

2024-03-23Common Sense ReasoningObject LocalizationVisual Question Answering (VQA)Multiple-choiceVisual Question Answering
PaperPDFCode(official)

Abstract

The advent of Vision Language Models (VLM) has allowed researchers to investigate the visual understanding of a neural network using natural language. Beyond object classification and detection, VLMs are capable of visual comprehension and common-sense reasoning. This naturally led to the question: How do VLMs respond when the image itself is inherently unreasonable? To this end, we present IllusionVQA: a diverse dataset of challenging optical illusions and hard-to-interpret scenes to test the capability of VLMs in two distinct multiple-choice VQA tasks - comprehension and soft localization. GPT4V, the best performing VLM, achieves 62.99% accuracy (4-shot) on the comprehension task and 49.7% on the localization task (4-shot and Chain-of-Thought). Human evaluation reveals that humans achieve 91.03% and 100% accuracy in comprehension and localization. We discover that In-Context Learning (ICL) and Chain-of-Thought reasoning substantially degrade the performance of Gemini-Pro in the localization task. Tangentially, we discover a potential weakness in the ICL capabilities of VLMs: they fail to locate optical illusions even when the correct answer is in the context window as a few-shot example.

Results

TaskDatasetMetricValueModel
Visual Question Answering (VQA)IllusionVQAAccuracy62.99GPT4-Vision 4-shot
Visual Question Answering (VQA)IllusionVQAAccuracy58.85GPT4-Vision
Visual Question Answering (VQA)IllusionVQAAccuracy52.87Gemini-Pro 4-shot
Visual Question Answering (VQA)IllusionVQAAccuracy51.26Gemini-Pro
Visual Question Answering (VQA)IllusionVQAAccuracy40LLaVA-1.5-13B
Visual Question Answering (VQA)IllusionVQAAccuracy38.16CogVLM
Visual Question Answering (VQA)IllusionVQAAccuracy34.25InstructBLIP-13B
Object LocalizationIllusionVQAAccuracy49.7GPT4-Vision 4-shot+CoT
Object LocalizationIllusionVQAAccuracy46GPT4-Vision 4-shot
Object LocalizationIllusionVQAAccuracy43.5Gemini-Pro
Object LocalizationIllusionVQAAccuracy41.8Gemini-Pro 4-shot
Object LocalizationIllusionVQAAccuracy40GPT4-Vision
Object LocalizationIllusionVQAAccuracy33.9Gemini-Pro 4-shot+CoT
Object LocalizationIllusionVQAAccuracy28CogVLM
Object LocalizationIllusionVQAAccuracy24.8LLaVA-1.5-13B
Object LocalizationIllusionVQAAccuracy24.3InstructBLIP-13B

Related Papers

Comparing Apples to Oranges: A Dataset & Analysis of LLM Humour Understanding from Traditional Puns to Topical Jokes2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations2025-07-17HATS: Hindi Analogy Test Set for Evaluating Reasoning in Large Language Models2025-07-17MGFFD-VLM: Multi-Granularity Prompt Learning for Face Forgery Detection with VLM2025-07-16Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16Evaluating Attribute Confusion in Fashion Text-to-Image Generation2025-07-09LinguaMark: Do Multimodal Models Speak Fairly? A Benchmark-Based Evaluation2025-07-09