Daoan Zhang, Junming Yang, Hanjia Lyu, Zijian Jin, Yuan YAO, Mingkai Chen, Jiebo Luo
When exploring the development of Artificial General Intelligence (AGI), a critical task for these models involves interpreting and processing information from multiple image inputs. However, Large Multimodal Models (LMMs) encounter two issues in such scenarios: (1) a lack of fine-grained perception, and (2) a tendency to blend information across multiple images. We first extensively investigate the capability of LMMs to perceive fine-grained visual details when dealing with multiple input images. The research focuses on two aspects: first, image-to-image matching (to evaluate whether LMMs can effectively reason and pair relevant images), and second, multi-image-to-text matching (to assess whether LMMs can accurately capture and summarize detailed image information). We conduct evaluations on a range of both open-source and closed-source large models, including GPT-4V, Gemini, OpenFlamingo, and MMICL. To enhance model performance, we further develop a Contrastive Chain-of-Thought (CoCoT) prompting approach based on multi-input multimodal models. This method requires LMMs to compare the similarities and differences among multiple image inputs, and then guide the models to answer detailed questions about multi-image inputs based on the identified similarities and differences. Our experimental results showcase CoCoT's proficiency in enhancing the multi-image comprehension capabilities of large multimodal models.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Visual Reasoning | Winoground | Group Score | 50.75 | MMICL + CoCoT |
| Visual Reasoning | Winoground | Image Score | 52.5 | MMICL + CoCoT |
| Visual Reasoning | Winoground | Text Score | 64.25 | MMICL + CoCoT |
| Visual Reasoning | Winoground | Group Score | 44.5 | GPT-4V + CoCoT |
| Visual Reasoning | Winoground | Image Score | 49.5 | GPT-4V + CoCoT |
| Visual Reasoning | Winoground | Text Score | 58.5 | GPT-4V + CoCoT |
| Visual Reasoning | Winoground | Group Score | 41.5 | OpenFlamingo + CoCoT |
| Visual Reasoning | Winoground | Image Score | 55.25 | OpenFlamingo + CoCoT |
| Visual Reasoning | Winoground | Text Score | 58.25 | OpenFlamingo + CoCoT |
| Visual Reasoning | Winoground | Group Score | 37.75 | GPT-4V |
| Visual Reasoning | Winoground | Image Score | 42.5 | GPT-4V |
| Visual Reasoning | Winoground | Text Score | 54.5 | GPT-4V |
| Visual Reasoning | Winoground | Group Score | 47.5 | MMICL + CCoT |
| Visual Reasoning | Winoground | Image Score | 48 | MMICL + CCoT |
| Visual Reasoning | Winoground | Text Score | 51 | MMICL + CCoT |
| Visual Reasoning | Winoground | Group Score | 39 | OpenFlamingo + DDCoT |
| Visual Reasoning | Winoground | Image Score | 47.25 | OpenFlamingo + DDCoT |
| Visual Reasoning | Winoground | Text Score | 47.5 | OpenFlamingo + DDCoT |
| Visual Reasoning | Winoground | Group Score | 36.75 | MMICL + DDCoT |
| Visual Reasoning | Winoground | Image Score | 45 | MMICL + DDCoT |
| Visual Reasoning | Winoground | Text Score | 46.75 | MMICL + DDCoT |
| Visual Reasoning | Winoground | Group Score | 23.75 | Gemini + DDCoT |
| Visual Reasoning | Winoground | Image Score | 25 | Gemini + DDCoT |
| Visual Reasoning | Winoground | Text Score | 45 | Gemini + DDCoT |
| Visual Reasoning | Winoground | Group Score | 20 | OpenFlamingo + CCoT |
| Visual Reasoning | Winoground | Image Score | 27.5 | OpenFlamingo + CCoT |
| Visual Reasoning | Winoground | Text Score | 42.5 | OpenFlamingo + CCoT |
| Visual Reasoning | Winoground | Group Score | 27.75 | Gemini + CoCoT |
| Visual Reasoning | Winoground | Image Score | 32.5 | Gemini + CoCoT |
| Visual Reasoning | Winoground | Text Score | 40 | Gemini + CoCoT |
| Visual Reasoning | Winoground | Group Score | 33.25 | OpenFlamingo |
| Visual Reasoning | Winoground | Image Score | 41.25 | OpenFlamingo |
| Visual Reasoning | Winoground | Text Score | 39 | OpenFlamingo |
| Visual Reasoning | Winoground | Group Score | 25 | Gemini |
| Visual Reasoning | Winoground | Image Score | 26 | Gemini |
| Visual Reasoning | Winoground | Text Score | 30.75 | Gemini |
| Visual Reasoning | Winoground | Group Score | 20.75 | Gemini + CCoT |
| Visual Reasoning | Winoground | Image Score | 33 | Gemini + CCoT |
| Visual Reasoning | Winoground | Text Score | 22.5 | Gemini + CCoT |