TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/A Cognitive Paradigm Approach to Probe the Perception-Reas...

A Cognitive Paradigm Approach to Probe the Perception-Reasoning Interface in VLMs

Mohit Vaishnav, Tanel Tammet

2025-01-23DescriptiveFew-Shot Image ClassificationVisual ReasoningDiagnostic
PaperPDF

Abstract

A fundamental challenge in artificial intelligence involves understanding the cognitive mechanisms underlying visual reasoning in sophisticated models like Vision-Language Models (VLMs). How do these models integrate visual perception with abstract thought, especially when reasoning across multiple images or requiring fine-grained compositional understanding? Drawing inspiration from cognitive science, this paper introduces a structured evaluation framework using diverse visual reasoning tasks-Bongard Problems (BPs) and Winoground-to dissect the perception-reasoning interface in VLMs. We propose three distinct evaluation paradigms, mirroring human problem-solving strategies: Direct Visual Rule Learning (DVRL; holistic processing), Deductive Rule Learning (DRL; rule extraction and application), and Componential Analysis (CA; analytical decomposition via task-agnostic textual descriptions). These paradigms systematically vary cognitive load and probe processing stages. Notably, CA enables multi-image reasoning evaluation even for single-image architectures and isolates reasoning from perception by operating on textual descriptions. Applying this framework, we demonstrate that CA, leveraging powerful language models for reasoning over rich, independently generated descriptions, achieves new state-of-the-art (SOTA) performance on challenging benchmarks including Bongard-OpenWorld, Bongard-HOI, and Winoground. Ablation studies confirm reasoning improves significantly when perceptual challenges are mitigated, revealing a critical perception bottleneck. Our framework provides a valuable diagnostic tool and suggests that decoupling perception (via rich, task-agnostic description) from reasoning is a promising direction for robust and general visual intelligence.

Results

TaskDatasetMetricValueModel
Visual ReasoningWinogroundGroup Score52GPT-4o + CA
Visual ReasoningWinogroundImage Score58.5GPT-4o + CA
Visual ReasoningWinogroundText Score75.5GPT-4o + CA
Visual ReasoningBongard-OpenWorld2-Class Accuracy93.6Gemini-2.0 + CA
Visual ReasoningBongard-OpenWorld2-Class Accuracy92.8GPT-4o + CA
Image ClassificationBongard-HOIAvg. Accuracy77.3GPT-4o + CA
Image ClassificationBongard-HOIAvg. Accuracy74.5Gemini 2.0 + CA
Few-Shot Image ClassificationBongard-HOIAvg. Accuracy77.3GPT-4o + CA
Few-Shot Image ClassificationBongard-HOIAvg. Accuracy74.5Gemini 2.0 + CA

Related Papers

Smart fault detection in satellite electrical power system2025-07-18DiffRhythm+: Controllable and Flexible Full-Length Song Generation with Preference Optimization2025-07-17LaViPlan : Language-Guided Visual Path Planning with RLVR2025-07-17Demographic-aware fine-grained classification of pediatric wrist fractures2025-07-17Assay2Mol: large language model-based drug design using BioAssay context2025-07-16Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16Trustworthy Tree-based Machine Learning by $MoS_2$ Flash-based Analog CAM with Inherent Soft Boundaries2025-07-16Beyond Task-Specific Reasoning: A Unified Conditional Generative Framework for Abstract Visual Reasoning2025-07-15