Compositional Chain-of-Thought Prompting for Large Multimodal Models

Chancharik Mitra, Brandon Huang, Trevor Darrell, Roei Herzig

2023-11-27CVPR 2024 1Visual Reasoning Large Language Model Language Modelling

Abstract

The combination of strong visual backbones and Large Language Model (LLM) reasoning has led to Large Multimodal Models (LMMs) becoming the current standard for a wide range of vision and language (VL) tasks. However, recent research has shown that even the most advanced LMMs still struggle to capture aspects of compositional visual reasoning, such as attributes and relationships between objects. One solution is to utilize scene graphs (SGs)--a formalization of objects and their relations and attributes that has been extensively used as a bridge between the visual and textual domains. Yet, scene graph data requires scene graph annotations, which are expensive to collect and thus not easily scalable. Moreover, finetuning an LMM based on SG data can lead to catastrophic forgetting of the pretraining objective. To overcome this, inspired by chain-of-thought methods, we propose Compositional Chain-of-Thought (CCoT), a novel zero-shot Chain-of-Thought prompting method that utilizes SG representations in order to extract compositional knowledge from an LMM. Specifically, we first generate an SG using the LMM, and then use that SG in the prompt to produce a response. Through extensive experiments, we find that the proposed CCoT approach not only improves LMM performance on several vision and language VL compositional benchmarks but also improves the performance of several popular LMMs on general multimodal benchmarks, without the need for fine-tuning or annotated ground-truth SGs. Code: https://github.com/chancharikmitra/CCoT

Results

Task	Dataset	Metric	Value	Model
Visual Reasoning	Winoground	Group Score	22.3	LLaVA-1.5-CCoT
Visual Reasoning	Winoground	Image Score	35.5	LLaVA-1.5-CCoT
Visual Reasoning	Winoground	Text Score	42	LLaVA-1.5-CCoT
Visual Reasoning	Winoground	Group Score	20.1	LLaVA-1.5
Visual Reasoning	Winoground	Image Score	33.3	LLaVA-1.5
Visual Reasoning	Winoground	Text Score	36	LLaVA-1.5
Visual Reasoning	Winoground	Group Score	12.3	LLaVA-1.5-ZS-CoT
Visual Reasoning	Winoground	Image Score	22.5	LLaVA-1.5-ZS-CoT
Visual Reasoning	Winoground	Text Score	28	LLaVA-1.5-ZS-CoT
Visual Reasoning	Winoground	Group Score	8.3	InstructBLIP-CCoT
Visual Reasoning	Winoground	Image Score	21.3	InstructBLIP-CCoT
Visual Reasoning	Winoground	Text Score	21	InstructBLIP-CCoT
Visual Reasoning	Winoground	Group Score	4	InstructBLIP-ZS-CoT
Visual Reasoning	Winoground	Image Score	16.3	InstructBLIP-ZS-CoT
Visual Reasoning	Winoground	Text Score	9.3	InstructBLIP-ZS-CoT
Visual Reasoning	Winoground	Group Score	3.3	InstructBLIP
Visual Reasoning	Winoground	Image Score	11.5	InstructBLIP
Visual Reasoning	Winoground	Text Score	7	InstructBLIP

Compositional Chain-of-Thought Prompting for Large Multimodal Models

Abstract

Results

Related Papers

Compositional Chain-of-Thought Prompting for Large Multimodal Models

Abstract

Results

Related Papers