TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Location-Free Scene Graph Generation

Location-Free Scene Graph Generation

Ege Özsoy, Felix Holm, Mahdi Saleh, Tobias Czempiel, Chantal Pellegrini, Nassir Navab, Benjamin Busam

2023-03-20Scene Graph GenerationQuestion AnsweringGraph GenerationVisual Question AnsweringImage Retrieval
PaperPDFCode(official)

Abstract

Scene Graph Generation (SGG) is a visual understanding task, aiming to describe a scene as a graph of entities and their relationships with each other. Existing works rely on location labels in form of bounding boxes or segmentation masks, increasing annotation costs and limiting dataset expansion. Recognizing that many applications do not require location data, we break this dependency and introduce location-free scene graph generation (LF-SGG). This new task aims at predicting instances of entities, as well as their relationships, without the explicit calculation of their spatial localization. To objectively evaluate the task, the predicted and ground truth scene graphs need to be compared. We solve this NP-hard problem through an efficient branching algorithm. Additionally, we design the first LF-SGG method, Pix2SG, using autoregressive sequence modeling. We demonstrate the effectiveness of our method on three scene graph generation datasets as well as two downstream tasks, image retrieval and visual question answering, and show that our approach is competitive to existing methods while not relying on location cues.

Results

TaskDatasetMetricValueModel
Scene Parsing4D-ORF10.9Pix2SG
2D Semantic Segmentation4D-ORF10.9Pix2SG
Scene Graph Generation4D-ORF10.9Pix2SG

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17NGTM: Substructure-based Neural Graph Topic Model for Interpretable Graph Generation2025-07-17FAR-Net: Multi-Stage Fusion Network with Enhanced Semantic Alignment and Adaptive Reconciliation for Composed Image Retrieval2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16