TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Weakly-supervised learning of visual relations

Weakly-supervised learning of visual relations

Julia Peyre, Ivan Laptev, Cordelia Schmid, Josef Sivic

2017-07-29ICCV 2017 10ClusteringRetrievalZero-Shot Learning
PaperPDF

Abstract

This paper introduces a novel approach for modeling visual relations between pairs of objects. We call relation a triplet of the form (subject, predicate, object) where the predicate is typically a preposition (eg. 'under', 'in front of') or a verb ('hold', 'ride') that links a pair of objects (subject, object). Learning such relations is challenging as the objects have different spatial configurations and appearances depending on the relation in which they occur. Another major challenge comes from the difficulty to get annotations, especially at box-level, for all possible triplets, which makes both learning and evaluation difficult. The contributions of this paper are threefold. First, we design strong yet flexible visual features that encode the appearance and spatial configuration for pairs of objects. Second, we propose a weakly-supervised discriminative clustering model to learn relations from image-level labels only. Third we introduce a new challenging dataset of unusual relations (UnRel) together with an exhaustive annotation, that enables accurate evaluation of visual relation retrieval. We show experimentally that our model results in state-of-the-art results on the visual relationship dataset significantly improving performance on previously unseen relations (zero-shot learning), and confirm this observation on our newly introduced UnRel dataset.

Results

TaskDatasetMetricValueModel
Scene ParsingVRD Relationship DetectionR@10017.1Peyre et. al [[Peyre et al.2017]]
Scene ParsingVRD Relationship DetectionR@5015.8Peyre et. al [[Peyre et al.2017]]
Scene ParsingVRD Predicate DetectionR@10052.6Peyre et. al [[Peyre et al.2017]]
Scene ParsingVRD Predicate DetectionR@5052.6Peyre et. al [[Peyre et al.2017]]
Scene ParsingVRD Phrase DetectionR@10019.5Peyre et. al [[Peyre et al.2017]]
Scene ParsingVRD Phrase DetectionR@5017.9Peyre et. al [[Peyre et al.2017]]
Visual Relationship DetectionVRD Relationship DetectionR@10017.1Peyre et. al [[Peyre et al.2017]]
Visual Relationship DetectionVRD Relationship DetectionR@5015.8Peyre et. al [[Peyre et al.2017]]
Visual Relationship DetectionVRD Predicate DetectionR@10052.6Peyre et. al [[Peyre et al.2017]]
Visual Relationship DetectionVRD Predicate DetectionR@5052.6Peyre et. al [[Peyre et al.2017]]
Visual Relationship DetectionVRD Phrase DetectionR@10019.5Peyre et. al [[Peyre et al.2017]]
Visual Relationship DetectionVRD Phrase DetectionR@5017.9Peyre et. al [[Peyre et al.2017]]
Scene UnderstandingVRD Relationship DetectionR@10017.1Peyre et. al [[Peyre et al.2017]]
Scene UnderstandingVRD Relationship DetectionR@5015.8Peyre et. al [[Peyre et al.2017]]
Scene UnderstandingVRD Predicate DetectionR@10052.6Peyre et. al [[Peyre et al.2017]]
Scene UnderstandingVRD Predicate DetectionR@5052.6Peyre et. al [[Peyre et al.2017]]
Scene UnderstandingVRD Phrase DetectionR@10019.5Peyre et. al [[Peyre et al.2017]]
Scene UnderstandingVRD Phrase DetectionR@5017.9Peyre et. al [[Peyre et al.2017]]
2D Semantic SegmentationVRD Relationship DetectionR@10017.1Peyre et. al [[Peyre et al.2017]]
2D Semantic SegmentationVRD Relationship DetectionR@5015.8Peyre et. al [[Peyre et al.2017]]
2D Semantic SegmentationVRD Predicate DetectionR@10052.6Peyre et. al [[Peyre et al.2017]]
2D Semantic SegmentationVRD Predicate DetectionR@5052.6Peyre et. al [[Peyre et al.2017]]
2D Semantic SegmentationVRD Phrase DetectionR@10019.5Peyre et. al [[Peyre et al.2017]]
2D Semantic SegmentationVRD Phrase DetectionR@5017.9Peyre et. al [[Peyre et al.2017]]

Related Papers

Tri-Learn Graph Fusion Network for Attributed Graph Clustering2025-07-18From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17GLAD: Generalizable Tuning for Vision-Language Models2025-07-17Ranking Vectors Clustering: Theory and Applications2025-07-16Developing Visual Augmented Q&A System using Scalable Vision Embedding Retrieval & Late Interaction Re-ranker2025-07-16