IllusionVQA: A Challenging Optical Illusion Dataset for Vision Language Models

HAZ Sameen Shahgir, Khondker Salman Sayeed, Abhik Bhattacharjee, Wasi Uddin Ahmad, Yue Dong, Rifat Shahriyar

2024-03-23Common Sense Reasoning Object Localization Visual Question Answering (VQA)Multiple-choice Visual Question Answering

Paper PDF Code(official)

Abstract

The advent of Vision Language Models (VLM) has allowed researchers to investigate the visual understanding of a neural network using natural language. Beyond object classification and detection, VLMs are capable of visual comprehension and common-sense reasoning. This naturally led to the question: How do VLMs respond when the image itself is inherently unreasonable? To this end, we present IllusionVQA: a diverse dataset of challenging optical illusions and hard-to-interpret scenes to test the capability of VLMs in two distinct multiple-choice VQA tasks - comprehension and soft localization. GPT4V, the best performing VLM, achieves 62.99% accuracy (4-shot) on the comprehension task and 49.7% on the localization task (4-shot and Chain-of-Thought). Human evaluation reveals that humans achieve 91.03% and 100% accuracy in comprehension and localization. We discover that In-Context Learning (ICL) and Chain-of-Thought reasoning substantially degrade the performance of Gemini-Pro in the localization task. Tangentially, we discover a potential weakness in the ICL capabilities of VLMs: they fail to locate optical illusions even when the correct answer is in the context window as a few-shot example.

Results

Task	Dataset	Metric	Value	Model
Visual Question Answering (VQA)	IllusionVQA	Accuracy	62.99	GPT4-Vision 4-shot
Visual Question Answering (VQA)	IllusionVQA	Accuracy	58.85	GPT4-Vision
Visual Question Answering (VQA)	IllusionVQA	Accuracy	52.87	Gemini-Pro 4-shot
Visual Question Answering (VQA)	IllusionVQA	Accuracy	51.26	Gemini-Pro
Visual Question Answering (VQA)	IllusionVQA	Accuracy	40	LLaVA-1.5-13B
Visual Question Answering (VQA)	IllusionVQA	Accuracy	38.16	CogVLM
Visual Question Answering (VQA)	IllusionVQA	Accuracy	34.25	InstructBLIP-13B
Object Localization	IllusionVQA	Accuracy	49.7	GPT4-Vision 4-shot+CoT
Object Localization	IllusionVQA	Accuracy	46	GPT4-Vision 4-shot
Object Localization	IllusionVQA	Accuracy	43.5	Gemini-Pro
Object Localization	IllusionVQA	Accuracy	41.8	Gemini-Pro 4-shot
Object Localization	IllusionVQA	Accuracy	40	GPT4-Vision
Object Localization	IllusionVQA	Accuracy	33.9	Gemini-Pro 4-shot+CoT
Object Localization	IllusionVQA	Accuracy	28	CogVLM
Object Localization	IllusionVQA	Accuracy	24.8	LLaVA-1.5-13B
Object Localization	IllusionVQA	Accuracy	24.3	InstructBLIP-13B

IllusionVQA: A Challenging Optical Illusion Dataset for Vision Language Models

Abstract

Results

Related Papers

IllusionVQA: A Challenging Optical Illusion Dataset for Vision Language Models

Abstract

Results

Related Papers