TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Incorporating Structured Representations into Pretrained V...

Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs

Roei Herzig, Alon Mendelson, Leonid Karlinsky, Assaf Arbelle, Rogerio Feris, Trevor Darrell, Amir Globerson

2023-05-10Scene UnderstandingVisual Reasoning
PaperPDF

Abstract

Vision and language models (VLMs) have demonstrated remarkable zero-shot (ZS) performance in a variety of tasks. However, recent works have shown that even the best VLMs struggle to capture aspects of compositional scene understanding, such as object attributes, relations, and action states. In contrast, obtaining structured annotations, such as scene graphs (SGs), that could improve these models is time-consuming and costly, and thus cannot be used on a large scale. Here we ask whether small SG datasets can provide sufficient information for enhancing structured understanding of pretrained VLMs. We show that it is indeed possible to improve VLMs when learning from SGs by integrating components that incorporate structured information into both visual and textual representations. For the visual side, we incorporate a special "SG Component" in the image transformer trained to predict SG information, while for the textual side, we utilize SGs to generate fine-grained captions that highlight different compositional aspects of the scene. Our method improves the performance of several popular VLMs on multiple VL datasets with only a mild degradation in ZS capabilities.

Results

TaskDatasetMetricValueModel
Visual ReasoningWinogroundGroup Score23.3BLIP2 (SGVL)
Visual ReasoningWinogroundImage Score28.5BLIP2 (SGVL)
Visual ReasoningWinogroundText Score42.8BLIP2 (SGVL)
Visual ReasoningWinogroundGroup Score21.5BLIP (SGVL)
Visual ReasoningWinogroundImage Score27.3BLIP (SGVL)
Visual ReasoningWinogroundText Score42.8BLIP (SGVL)
Visual ReasoningWinogroundGroup Score18.5NegBLIP
Visual ReasoningWinogroundImage Score24NegBLIP
Visual ReasoningWinogroundText Score42.5NegBLIP
Visual ReasoningWinogroundGroup Score19BLIP2
Visual ReasoningWinogroundImage Score23.8BLIP2
Visual ReasoningWinogroundText Score42BLIP2
Visual ReasoningWinogroundGroup Score20.5NegBLIP2
Visual ReasoningWinogroundImage Score26NegBLIP2
Visual ReasoningWinogroundText Score41.5NegBLIP2
Visual ReasoningWinogroundGroup Score19BLIP (+Graph Text, +Graph Neg)
Visual ReasoningWinogroundImage Score25.5BLIP (+Graph Text, +Graph Neg)
Visual ReasoningWinogroundText Score40.5BLIP (+Graph Text, +Graph Neg)
Visual ReasoningWinogroundGroup Score16.5BLIP (+Graph Text)
Visual ReasoningWinogroundImage Score20.5BLIP (+Graph Text)
Visual ReasoningWinogroundText Score40.3BLIP (+Graph Text)
Visual ReasoningWinogroundGroup Score15BLIP
Visual ReasoningWinogroundImage Score19.2BLIP
Visual ReasoningWinogroundText Score39BLIP
Visual ReasoningWinogroundGroup Score9.8CLIP (SGVL)
Visual ReasoningWinogroundImage Score14CLIP (SGVL)
Visual ReasoningWinogroundText Score32CLIP (SGVL)
Visual ReasoningWinogroundGroup Score8NegCLIP
Visual ReasoningWinogroundImage Score10.5NegCLIP
Visual ReasoningWinogroundText Score29.5NegCLIP
Visual ReasoningWinogroundGroup Score13LLaVA
Visual ReasoningWinogroundImage Score25LLaVA
Visual ReasoningWinogroundText Score24.8LLaVA
Visual ReasoningWinogroundGroup Score9.5MiniGPT-4
Visual ReasoningWinogroundImage Score18MiniGPT-4
Visual ReasoningWinogroundText Score23.3MiniGPT-4

Related Papers

Advancing Complex Wide-Area Scene Understanding with Hierarchical Coresets Selection2025-07-17Argus: Leveraging Multiview Images for Improved 3-D Scene Understanding With Large Language Models2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17LaViPlan : Language-Guided Visual Path Planning with RLVR2025-07-17Learning to Tune Like an Expert: Interpretable and Scene-Aware Navigation via MLLM Reasoning and CVAE-Based Adaptation2025-07-15Tactical Decision for Multi-UGV Confrontation with a Vision-Language Model-Based Commander2025-07-15Seeing the Signs: A Survey of Edge-Deployable OCR Models for Billboard Visibility Analysis2025-07-15Beyond Task-Specific Reasoning: A Unified Conditional Generative Framework for Abstract Visual Reasoning2025-07-15