TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/An Examination of the Compositionality of Large Generative...

An Examination of the Compositionality of Large Generative Vision-Language Models

Teli Ma, Rong Li, Junwei Liang

2023-08-21Visual Reasoning
PaperPDFCode(official)

Abstract

With the success of Large Language Models (LLMs), many Generative Vision-Language Models (GVLMs) have been constructed via multimodal instruction tuning. However, the performance of GVLMs in multimodal compositional reasoning remains under-explored. In this paper, we examine both the evaluation metrics (VisualGPTScore, etc.) and current benchmarks for evaluating the compositionality of GVLMs. We identify the syntactical bias in current benchmarks, which is exploited by the linguistic capability of GVLMs. The bias renders VisualGPTScore an insufficient metric for assessing GVLMs. To combat this, we first introduce a SyntaxBias Score, leveraging LLMs to quantify such bias for mitigation. A challenging new task is subsequently added to evaluate the robustness of GVLMs against inherent inclination toward syntactical correctness. Using the bias-mitigated datasets and the new task, we propose a novel benchmark, namely SyntActically DE-biased benchmark (SADE). Our study provides an unbiased benchmark for the compositionality of GVLMs, facilitating future research in this direction (Code and dataset are available at https://github.com/TeleeMa/SADE).

Results

TaskDatasetMetricValueModel
Visual ReasoningWinogroundGroup Score10.5LLaVA-7B (GPTScore)
Visual ReasoningWinogroundImage Score17LLaVA-7B (GPTScore)
Visual ReasoningWinogroundText Score25.5LLaVA-7B (GPTScore)
Visual ReasoningWinogroundGroup Score11.5MiniGPT-4-7B (GPTScore)
Visual ReasoningWinogroundImage Score21.75MiniGPT-4-7B (GPTScore)
Visual ReasoningWinogroundText Score24.5MiniGPT-4-7B (GPTScore)
Visual ReasoningWinogroundGroup Score9.5MiniGPT-4-7B (VisualGPTScore)
Visual ReasoningWinogroundImage Score18MiniGPT-4-7B (VisualGPTScore)
Visual ReasoningWinogroundText Score23.25MiniGPT-4-7B (VisualGPTScore)
Visual ReasoningWinogroundGroup Score2.75MiniGPT-4-7B (BERTScore)
Visual ReasoningWinogroundImage Score8MiniGPT-4-7B (BERTScore)
Visual ReasoningWinogroundText Score14MiniGPT-4-7B (BERTScore)
Visual ReasoningWinogroundGroup Score2.25LLaVA-7B (BERTScore)
Visual ReasoningWinogroundImage Score5.25LLaVA-7B (BERTScore)
Visual ReasoningWinogroundText Score13.5LLaVA-7B (BERTScore)

Related Papers

LaViPlan : Language-Guided Visual Path Planning with RLVR2025-07-17Beyond Task-Specific Reasoning: A Unified Conditional Generative Framework for Abstract Visual Reasoning2025-07-15PyVision: Agentic Vision with Dynamic Tooling2025-07-10Orchestrator-Agent Trust: A Modular Agentic AI Visual Classification System with Trust-Aware Orchestration and RAG-Based Reasoning2025-07-09MagiC: Evaluating Multimodal Cognition Toward Grounded Visual Reasoning2025-07-09Skywork-R1V3 Technical Report2025-07-08High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning2025-07-08Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for Visual Reasoning2025-07-07