TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Prompting Large Vision-Language Models for Compositional R...

Prompting Large Vision-Language Models for Compositional Reasoning

Timothy Ossowski, Ming Jiang, Junjie Hu

2024-01-20Visual ReasoningRetrieval
PaperPDFCode(official)

Abstract

Vision-language models such as CLIP have shown impressive capabilities in encoding texts and images into aligned embeddings, enabling the retrieval of multimodal data in a shared embedding space. However, these embedding-based models still face challenges in effectively matching images and texts with similar visio-linguistic compositionality, as evidenced by their performance on the recent Winoground dataset. In this paper, we argue that this limitation stems from two factors: the use of single vector representations for complex multimodal data, and the absence of step-by-step reasoning in these embedding-based methods. To address this issue, we make an exploratory step using a novel generative method that prompts large vision-language models (e.g., GPT-4) to depict images and perform compositional reasoning. Our method outperforms other embedding-based methods on the Winoground dataset, and obtains further improvement of up to 10% accuracy when enhanced with the optimal description.

Results

TaskDatasetMetricValueModel
Visual ReasoningWinogroundGroup Score18.2KeyComp* (GPT-4)
Visual ReasoningWinogroundImage Score28.7KeyComp* (GPT-4)
Visual ReasoningWinogroundText Score43.5KeyComp* (GPT-4)
Visual ReasoningWinogroundGroup Score17.4KeyComp* (GPT-3.5)
Visual ReasoningWinogroundImage Score27.8KeyComp* (GPT-3.5)
Visual ReasoningWinogroundText Score42.7KeyComp* (GPT-3.5)
Visual ReasoningWinogroundGroup Score12.4KeyComp (GPT-3.5)
Visual ReasoningWinogroundImage Score24.6KeyComp (GPT-3.5)
Visual ReasoningWinogroundText Score30.3KeyComp (GPT-3.5)

Related Papers

LaViPlan : Language-Guided Visual Path Planning with RLVR2025-07-17From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17Developing Visual Augmented Q&A System using Scalable Vision Embedding Retrieval & Late Interaction Re-ranker2025-07-16Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16Context-Aware Search and Retrieval Over Erasure Channels2025-07-16