TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Visual Relationship Detection with Language Priors

Visual Relationship Detection with Language Priors

Cewu Lu, Ranjay Krishna, Michael Bernstein, Li Fei-Fei

2016-07-31Content-Based Image RetrievalVisual Relationship DetectionWord EmbeddingsRetrievalRelationship DetectionImage Retrieval
PaperPDF

Abstract

Visual relationships capture a wide variety of interactions between pairs of objects in images (e.g. "man riding bicycle" and "man pushing bicycle"). Consequently, the set of possible relationships is extremely large and it is difficult to obtain sufficient training examples for all possible relationships. Because of this limitation, previous work on visual relationship detection has concentrated on predicting only a handful of relationships. Though most relationships are infrequent, their objects (e.g. "man" and "bicycle") and predicates (e.g. "riding" and "pushing") independently occur more frequently. We propose a model that uses this insight to train visual models for objects and predicates individually and later combines them together to predict multiple relationships per image. We improve on prior work by leveraging language priors from semantic word embeddings to finetune the likelihood of a predicted relationship. Our model can scale to predict thousands of types of relationships from a few examples. Additionally, we localize the objects in the predicted relationships as bounding boxes in the image. We further demonstrate that understanding relationships can improve content based image retrieval.

Results

TaskDatasetMetricValueModel
Scene ParsingVRD Relationship DetectionR@10014.7Lu et. al [[Lu et al.2016]]
Scene ParsingVRD Relationship DetectionR@5013.86Lu et. al [[Lu et al.2016]]
Scene ParsingVRD Predicate DetectionR@10047.87Lu et. al [[Lu et al.2016]]
Scene ParsingVRD Predicate DetectionR@5047.87Lu et. al [[Lu et al.2016]]
Scene ParsingVRD Phrase DetectionR@10017.03Lu et. al [[Lu et al.2016]]
Scene ParsingVRD Phrase DetectionR@5016.17Lu et. al [[Lu et al.2016]]
Scene ParsingVRDRecall@5018.16VRD
Visual Relationship DetectionVRD Relationship DetectionR@10014.7Lu et. al [[Lu et al.2016]]
Visual Relationship DetectionVRD Relationship DetectionR@5013.86Lu et. al [[Lu et al.2016]]
Visual Relationship DetectionVRD Predicate DetectionR@10047.87Lu et. al [[Lu et al.2016]]
Visual Relationship DetectionVRD Predicate DetectionR@5047.87Lu et. al [[Lu et al.2016]]
Visual Relationship DetectionVRD Phrase DetectionR@10017.03Lu et. al [[Lu et al.2016]]
Visual Relationship DetectionVRD Phrase DetectionR@5016.17Lu et. al [[Lu et al.2016]]
Scene UnderstandingVRD Relationship DetectionR@10014.7Lu et. al [[Lu et al.2016]]
Scene UnderstandingVRD Relationship DetectionR@5013.86Lu et. al [[Lu et al.2016]]
Scene UnderstandingVRD Predicate DetectionR@10047.87Lu et. al [[Lu et al.2016]]
Scene UnderstandingVRD Predicate DetectionR@5047.87Lu et. al [[Lu et al.2016]]
Scene UnderstandingVRD Phrase DetectionR@10017.03Lu et. al [[Lu et al.2016]]
Scene UnderstandingVRD Phrase DetectionR@5016.17Lu et. al [[Lu et al.2016]]
2D Semantic SegmentationVRD Relationship DetectionR@10014.7Lu et. al [[Lu et al.2016]]
2D Semantic SegmentationVRD Relationship DetectionR@5013.86Lu et. al [[Lu et al.2016]]
2D Semantic SegmentationVRD Predicate DetectionR@10047.87Lu et. al [[Lu et al.2016]]
2D Semantic SegmentationVRD Predicate DetectionR@5047.87Lu et. al [[Lu et al.2016]]
2D Semantic SegmentationVRD Phrase DetectionR@10017.03Lu et. al [[Lu et al.2016]]
2D Semantic SegmentationVRD Phrase DetectionR@5016.17Lu et. al [[Lu et al.2016]]
2D Semantic SegmentationVRDRecall@5018.16VRD
Scene Graph GenerationVRDRecall@5018.16VRD

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17FAR-Net: Multi-Stage Fusion Network with Enhanced Semantic Alignment and Adaptive Reconciliation for Composed Image Retrieval2025-07-17Developing Visual Augmented Q&A System using Scalable Vision Embedding Retrieval & Late Interaction Re-ranker2025-07-16Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16Context-Aware Search and Retrieval Over Erasure Channels2025-07-16