The Role of Chain-of-Thought in Complex Vision-Language Reasoning Task

Yifan Wu, Pengchuan Zhang, Wenhan Xiong, Barlas Oguz, James C. Gee, Yixin Nie

2023-11-15Visual Reasoning

Abstract

The study explores the effectiveness of the Chain-of-Thought approach, known for its proficiency in language tasks by breaking them down into sub-tasks and intermediate steps, in improving vision-language tasks that demand sophisticated perception and reasoning. We present the "Description then Decision" strategy, which is inspired by how humans process signals. This strategy significantly improves probing task performance by 50%, establishing the groundwork for future research on reasoning paradigms in complex vision-language tasks.

Results

Task	Dataset	Metric	Value	Model
Visual Reasoning	Winoground	Group Score	58.75	GPT-4V (CoT, pick b/w two options)
Visual Reasoning	Winoground	Image Score	68.75	GPT-4V (CoT, pick b/w two options)
Visual Reasoning	Winoground	Text Score	75.25	GPT-4V (CoT, pick b/w two options)
Visual Reasoning	Winoground	Group Score	39.25	GPT-4V (pick b/w two options)
Visual Reasoning	Winoground	Image Score	46.25	GPT-4V (pick b/w two options)
Visual Reasoning	Winoground	Text Score	69.25	GPT-4V (pick b/w two options)

Related Papers

LaViPlan : Language-Guided Visual Path Planning with RLVR2025-07-17 Beyond Task-Specific Reasoning: A Unified Conditional Generative Framework for Abstract Visual Reasoning2025-07-15 PyVision: Agentic Vision with Dynamic Tooling2025-07-10 Orchestrator-Agent Trust: A Modular Agentic AI Visual Classification System with Trust-Aware Orchestration and RAG-Based Reasoning2025-07-09 MagiC: Evaluating Multimodal Cognition Toward Grounded Visual Reasoning2025-07-09 Skywork-R1V3 Technical Report2025-07-08 High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning2025-07-08 Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for Visual Reasoning2025-07-07