CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs

Daoan Zhang, Junming Yang, Hanjia Lyu, Zijian Jin, Yuan YAO, Mingkai Chen, Jiebo Luo

2024-01-05Text Matching Image Comprehension Image to text Visual Reasoning

Abstract

When exploring the development of Artificial General Intelligence (AGI), a critical task for these models involves interpreting and processing information from multiple image inputs. However, Large Multimodal Models (LMMs) encounter two issues in such scenarios: (1) a lack of fine-grained perception, and (2) a tendency to blend information across multiple images. We first extensively investigate the capability of LMMs to perceive fine-grained visual details when dealing with multiple input images. The research focuses on two aspects: first, image-to-image matching (to evaluate whether LMMs can effectively reason and pair relevant images), and second, multi-image-to-text matching (to assess whether LMMs can accurately capture and summarize detailed image information). We conduct evaluations on a range of both open-source and closed-source large models, including GPT-4V, Gemini, OpenFlamingo, and MMICL. To enhance model performance, we further develop a Contrastive Chain-of-Thought (CoCoT) prompting approach based on multi-input multimodal models. This method requires LMMs to compare the similarities and differences among multiple image inputs, and then guide the models to answer detailed questions about multi-image inputs based on the identified similarities and differences. Our experimental results showcase CoCoT's proficiency in enhancing the multi-image comprehension capabilities of large multimodal models.

Results

Task	Dataset	Metric	Value	Model
Visual Reasoning	Winoground	Group Score	50.75	MMICL + CoCoT
Visual Reasoning	Winoground	Image Score	52.5	MMICL + CoCoT
Visual Reasoning	Winoground	Text Score	64.25	MMICL + CoCoT
Visual Reasoning	Winoground	Group Score	44.5	GPT-4V + CoCoT
Visual Reasoning	Winoground	Image Score	49.5	GPT-4V + CoCoT
Visual Reasoning	Winoground	Text Score	58.5	GPT-4V + CoCoT
Visual Reasoning	Winoground	Group Score	41.5	OpenFlamingo + CoCoT
Visual Reasoning	Winoground	Image Score	55.25	OpenFlamingo + CoCoT
Visual Reasoning	Winoground	Text Score	58.25	OpenFlamingo + CoCoT
Visual Reasoning	Winoground	Group Score	37.75	GPT-4V
Visual Reasoning	Winoground	Image Score	42.5	GPT-4V
Visual Reasoning	Winoground	Text Score	54.5	GPT-4V
Visual Reasoning	Winoground	Group Score	47.5	MMICL + CCoT
Visual Reasoning	Winoground	Image Score	48	MMICL + CCoT
Visual Reasoning	Winoground	Text Score	51	MMICL + CCoT
Visual Reasoning	Winoground	Group Score	39	OpenFlamingo + DDCoT
Visual Reasoning	Winoground	Image Score	47.25	OpenFlamingo + DDCoT
Visual Reasoning	Winoground	Text Score	47.5	OpenFlamingo + DDCoT
Visual Reasoning	Winoground	Group Score	36.75	MMICL + DDCoT
Visual Reasoning	Winoground	Image Score	45	MMICL + DDCoT
Visual Reasoning	Winoground	Text Score	46.75	MMICL + DDCoT
Visual Reasoning	Winoground	Group Score	23.75	Gemini + DDCoT
Visual Reasoning	Winoground	Image Score	25	Gemini + DDCoT
Visual Reasoning	Winoground	Text Score	45	Gemini + DDCoT
Visual Reasoning	Winoground	Group Score	20	OpenFlamingo + CCoT
Visual Reasoning	Winoground	Image Score	27.5	OpenFlamingo + CCoT
Visual Reasoning	Winoground	Text Score	42.5	OpenFlamingo + CCoT
Visual Reasoning	Winoground	Group Score	27.75	Gemini + CoCoT
Visual Reasoning	Winoground	Image Score	32.5	Gemini + CoCoT
Visual Reasoning	Winoground	Text Score	40	Gemini + CoCoT
Visual Reasoning	Winoground	Group Score	33.25	OpenFlamingo
Visual Reasoning	Winoground	Image Score	41.25	OpenFlamingo
Visual Reasoning	Winoground	Text Score	39	OpenFlamingo
Visual Reasoning	Winoground	Group Score	25	Gemini
Visual Reasoning	Winoground	Image Score	26	Gemini
Visual Reasoning	Winoground	Text Score	30.75	Gemini
Visual Reasoning	Winoground	Group Score	20.75	Gemini + CCoT
Visual Reasoning	Winoground	Image Score	33	Gemini + CCoT
Visual Reasoning	Winoground	Text Score	22.5	Gemini + CCoT

Abstract

Results

Task	Dataset	Metric	Value	Model
Visual Reasoning	Winoground	Group Score	50.75	MMICL + CoCoT
Visual Reasoning	Winoground	Image Score	52.5	MMICL + CoCoT
Visual Reasoning	Winoground	Text Score	64.25	MMICL + CoCoT
Visual Reasoning	Winoground	Group Score	44.5	GPT-4V + CoCoT
Visual Reasoning	Winoground	Image Score	49.5	GPT-4V + CoCoT
Visual Reasoning	Winoground	Text Score	58.5	GPT-4V + CoCoT
Visual Reasoning	Winoground	Group Score	41.5	OpenFlamingo + CoCoT
Visual Reasoning	Winoground	Image Score	55.25	OpenFlamingo + CoCoT
Visual Reasoning	Winoground	Text Score	58.25	OpenFlamingo + CoCoT
Visual Reasoning	Winoground	Group Score	37.75	GPT-4V
Visual Reasoning	Winoground	Image Score	42.5	GPT-4V
Visual Reasoning	Winoground	Text Score	54.5	GPT-4V
Visual Reasoning	Winoground	Group Score	47.5	MMICL + CCoT
Visual Reasoning	Winoground	Image Score	48	MMICL + CCoT
Visual Reasoning	Winoground	Text Score	51	MMICL + CCoT
Visual Reasoning	Winoground	Group Score	39	OpenFlamingo + DDCoT
Visual Reasoning	Winoground	Image Score	47.25	OpenFlamingo + DDCoT
Visual Reasoning	Winoground	Text Score	47.5	OpenFlamingo + DDCoT
Visual Reasoning	Winoground	Group Score	36.75	MMICL + DDCoT
Visual Reasoning	Winoground	Image Score	45	MMICL + DDCoT
Visual Reasoning	Winoground	Text Score	46.75	MMICL + DDCoT
Visual Reasoning	Winoground	Group Score	23.75	Gemini + DDCoT
Visual Reasoning	Winoground	Image Score	25	Gemini + DDCoT
Visual Reasoning	Winoground	Text Score	45	Gemini + DDCoT
Visual Reasoning	Winoground	Group Score	20	OpenFlamingo + CCoT
Visual Reasoning	Winoground	Image Score	27.5	OpenFlamingo + CCoT
Visual Reasoning	Winoground	Text Score	42.5	OpenFlamingo + CCoT
Visual Reasoning	Winoground	Group Score	27.75	Gemini + CoCoT
Visual Reasoning	Winoground	Image Score	32.5	Gemini + CoCoT
Visual Reasoning	Winoground	Text Score	40	Gemini + CoCoT
Visual Reasoning	Winoground	Group Score	33.25	OpenFlamingo
Visual Reasoning	Winoground	Image Score	41.25	OpenFlamingo
Visual Reasoning	Winoground	Text Score	39	OpenFlamingo
Visual Reasoning	Winoground	Group Score	25	Gemini
Visual Reasoning	Winoground	Image Score	26	Gemini
Visual Reasoning	Winoground	Text Score	30.75	Gemini
Visual Reasoning	Winoground	Group Score	20.75	Gemini + CCoT
Visual Reasoning	Winoground	Image Score	33	Gemini + CCoT
Visual Reasoning	Winoground	Text Score	22.5	Gemini + CCoT

CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs

Abstract

Results

Related Papers

CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs

Abstract

Results

Related Papers