TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/CoCoT: Contrastive Chain-of-Thought Prompting for Large Mu...

CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs

Daoan Zhang, Junming Yang, Hanjia Lyu, Zijian Jin, Yuan YAO, Mingkai Chen, Jiebo Luo

2024-01-05Text MatchingImage ComprehensionImage to textVisual Reasoning
PaperPDFCode

Abstract

When exploring the development of Artificial General Intelligence (AGI), a critical task for these models involves interpreting and processing information from multiple image inputs. However, Large Multimodal Models (LMMs) encounter two issues in such scenarios: (1) a lack of fine-grained perception, and (2) a tendency to blend information across multiple images. We first extensively investigate the capability of LMMs to perceive fine-grained visual details when dealing with multiple input images. The research focuses on two aspects: first, image-to-image matching (to evaluate whether LMMs can effectively reason and pair relevant images), and second, multi-image-to-text matching (to assess whether LMMs can accurately capture and summarize detailed image information). We conduct evaluations on a range of both open-source and closed-source large models, including GPT-4V, Gemini, OpenFlamingo, and MMICL. To enhance model performance, we further develop a Contrastive Chain-of-Thought (CoCoT) prompting approach based on multi-input multimodal models. This method requires LMMs to compare the similarities and differences among multiple image inputs, and then guide the models to answer detailed questions about multi-image inputs based on the identified similarities and differences. Our experimental results showcase CoCoT's proficiency in enhancing the multi-image comprehension capabilities of large multimodal models.

Results

TaskDatasetMetricValueModel
Visual ReasoningWinogroundGroup Score50.75MMICL + CoCoT
Visual ReasoningWinogroundImage Score52.5MMICL + CoCoT
Visual ReasoningWinogroundText Score64.25MMICL + CoCoT
Visual ReasoningWinogroundGroup Score44.5GPT-4V + CoCoT
Visual ReasoningWinogroundImage Score49.5GPT-4V + CoCoT
Visual ReasoningWinogroundText Score58.5GPT-4V + CoCoT
Visual ReasoningWinogroundGroup Score41.5OpenFlamingo + CoCoT
Visual ReasoningWinogroundImage Score55.25OpenFlamingo + CoCoT
Visual ReasoningWinogroundText Score58.25OpenFlamingo + CoCoT
Visual ReasoningWinogroundGroup Score37.75GPT-4V
Visual ReasoningWinogroundImage Score42.5GPT-4V
Visual ReasoningWinogroundText Score54.5GPT-4V
Visual ReasoningWinogroundGroup Score47.5MMICL + CCoT
Visual ReasoningWinogroundImage Score48MMICL + CCoT
Visual ReasoningWinogroundText Score51MMICL + CCoT
Visual ReasoningWinogroundGroup Score39OpenFlamingo + DDCoT
Visual ReasoningWinogroundImage Score47.25OpenFlamingo + DDCoT
Visual ReasoningWinogroundText Score47.5OpenFlamingo + DDCoT
Visual ReasoningWinogroundGroup Score36.75MMICL + DDCoT
Visual ReasoningWinogroundImage Score45MMICL + DDCoT
Visual ReasoningWinogroundText Score46.75MMICL + DDCoT
Visual ReasoningWinogroundGroup Score23.75Gemini + DDCoT
Visual ReasoningWinogroundImage Score25Gemini + DDCoT
Visual ReasoningWinogroundText Score45Gemini + DDCoT
Visual ReasoningWinogroundGroup Score20OpenFlamingo + CCoT
Visual ReasoningWinogroundImage Score27.5OpenFlamingo + CCoT
Visual ReasoningWinogroundText Score42.5OpenFlamingo + CCoT
Visual ReasoningWinogroundGroup Score27.75Gemini + CoCoT
Visual ReasoningWinogroundImage Score32.5Gemini + CoCoT
Visual ReasoningWinogroundText Score40Gemini + CoCoT
Visual ReasoningWinogroundGroup Score33.25OpenFlamingo
Visual ReasoningWinogroundImage Score41.25OpenFlamingo
Visual ReasoningWinogroundText Score39OpenFlamingo
Visual ReasoningWinogroundGroup Score25Gemini
Visual ReasoningWinogroundImage Score26Gemini
Visual ReasoningWinogroundText Score30.75Gemini
Visual ReasoningWinogroundGroup Score20.75Gemini + CCoT
Visual ReasoningWinogroundImage Score33Gemini + CCoT
Visual ReasoningWinogroundText Score22.5Gemini + CCoT

Related Papers

LaViPlan : Language-Guided Visual Path Planning with RLVR2025-07-17Beyond Task-Specific Reasoning: A Unified Conditional Generative Framework for Abstract Visual Reasoning2025-07-15PyVision: Agentic Vision with Dynamic Tooling2025-07-10Orchestrator-Agent Trust: A Modular Agentic AI Visual Classification System with Trust-Aware Orchestration and RAG-Based Reasoning2025-07-09MagiC: Evaluating Multimodal Cognition Toward Grounded Visual Reasoning2025-07-09Skywork-R1V3 Technical Report2025-07-08High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning2025-07-08Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for Visual Reasoning2025-07-07