TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/The Role of Chain-of-Thought in Complex Vision-Language Re...

The Role of Chain-of-Thought in Complex Vision-Language Reasoning Task

Yifan Wu, Pengchuan Zhang, Wenhan Xiong, Barlas Oguz, James C. Gee, Yixin Nie

2023-11-15Visual Reasoning
PaperPDF

Abstract

The study explores the effectiveness of the Chain-of-Thought approach, known for its proficiency in language tasks by breaking them down into sub-tasks and intermediate steps, in improving vision-language tasks that demand sophisticated perception and reasoning. We present the "Description then Decision" strategy, which is inspired by how humans process signals. This strategy significantly improves probing task performance by 50%, establishing the groundwork for future research on reasoning paradigms in complex vision-language tasks.

Results

TaskDatasetMetricValueModel
Visual ReasoningWinogroundGroup Score58.75GPT-4V (CoT, pick b/w two options)
Visual ReasoningWinogroundImage Score68.75GPT-4V (CoT, pick b/w two options)
Visual ReasoningWinogroundText Score75.25GPT-4V (CoT, pick b/w two options)
Visual ReasoningWinogroundGroup Score39.25GPT-4V (pick b/w two options)
Visual ReasoningWinogroundImage Score46.25GPT-4V (pick b/w two options)
Visual ReasoningWinogroundText Score69.25GPT-4V (pick b/w two options)

Related Papers

LaViPlan : Language-Guided Visual Path Planning with RLVR2025-07-17Beyond Task-Specific Reasoning: A Unified Conditional Generative Framework for Abstract Visual Reasoning2025-07-15PyVision: Agentic Vision with Dynamic Tooling2025-07-10Orchestrator-Agent Trust: A Modular Agentic AI Visual Classification System with Trust-Aware Orchestration and RAG-Based Reasoning2025-07-09MagiC: Evaluating Multimodal Cognition Toward Grounded Visual Reasoning2025-07-09Skywork-R1V3 Technical Report2025-07-08High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning2025-07-08Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for Visual Reasoning2025-07-07