TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Compositional Chain-of-Thought Prompting for Large Multimo...

Compositional Chain-of-Thought Prompting for Large Multimodal Models

Chancharik Mitra, Brandon Huang, Trevor Darrell, Roei Herzig

2023-11-27CVPR 2024 1Visual ReasoningLarge Language ModelLanguage Modelling
PaperPDFCode(official)

Abstract

The combination of strong visual backbones and Large Language Model (LLM) reasoning has led to Large Multimodal Models (LMMs) becoming the current standard for a wide range of vision and language (VL) tasks. However, recent research has shown that even the most advanced LMMs still struggle to capture aspects of compositional visual reasoning, such as attributes and relationships between objects. One solution is to utilize scene graphs (SGs)--a formalization of objects and their relations and attributes that has been extensively used as a bridge between the visual and textual domains. Yet, scene graph data requires scene graph annotations, which are expensive to collect and thus not easily scalable. Moreover, finetuning an LMM based on SG data can lead to catastrophic forgetting of the pretraining objective. To overcome this, inspired by chain-of-thought methods, we propose Compositional Chain-of-Thought (CCoT), a novel zero-shot Chain-of-Thought prompting method that utilizes SG representations in order to extract compositional knowledge from an LMM. Specifically, we first generate an SG using the LMM, and then use that SG in the prompt to produce a response. Through extensive experiments, we find that the proposed CCoT approach not only improves LMM performance on several vision and language VL compositional benchmarks but also improves the performance of several popular LMMs on general multimodal benchmarks, without the need for fine-tuning or annotated ground-truth SGs. Code: https://github.com/chancharikmitra/CCoT

Results

TaskDatasetMetricValueModel
Visual ReasoningWinogroundGroup Score22.3LLaVA-1.5-CCoT
Visual ReasoningWinogroundImage Score35.5LLaVA-1.5-CCoT
Visual ReasoningWinogroundText Score42LLaVA-1.5-CCoT
Visual ReasoningWinogroundGroup Score20.1LLaVA-1.5
Visual ReasoningWinogroundImage Score33.3LLaVA-1.5
Visual ReasoningWinogroundText Score36LLaVA-1.5
Visual ReasoningWinogroundGroup Score12.3LLaVA-1.5-ZS-CoT
Visual ReasoningWinogroundImage Score22.5LLaVA-1.5-ZS-CoT
Visual ReasoningWinogroundText Score28LLaVA-1.5-ZS-CoT
Visual ReasoningWinogroundGroup Score8.3InstructBLIP-CCoT
Visual ReasoningWinogroundImage Score21.3InstructBLIP-CCoT
Visual ReasoningWinogroundText Score21InstructBLIP-CCoT
Visual ReasoningWinogroundGroup Score4InstructBLIP-ZS-CoT
Visual ReasoningWinogroundImage Score16.3InstructBLIP-ZS-CoT
Visual ReasoningWinogroundText Score9.3InstructBLIP-ZS-CoT
Visual ReasoningWinogroundGroup Score3.3InstructBLIP
Visual ReasoningWinogroundImage Score11.5InstructBLIP
Visual ReasoningWinogroundText Score7InstructBLIP

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21DENSE: Longitudinal Progress Note Generation with Temporal Modeling of Heterogeneous Clinical Notes Across Hospital Visits2025-07-18LaViPlan : Language-Guided Visual Path Planning with RLVR2025-07-17GeoReg: Weight-Constrained Few-Shot Regression for Socio-Economic Estimation using LLM2025-07-17The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations2025-07-17Inverse Reinforcement Learning Meets Large Language Model Post-Training: Basics, Advances, and Opportunities2025-07-17Rethinking the Embodied Gap in Vision-and-Language Navigation: A Holistic Study of Physical and Visual Disparities2025-07-17Making Language Model a Hierarchical Classifier and Generator2025-07-17