TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/What and When to Look?: Temporal Span Proposal Network for...

What and When to Look?: Temporal Span Proposal Network for Video Relation Detection

Sangmin Woo, Junhyug Noh, Kangil Kim

2021-07-15Video Visual Relation DetectionVideo Visual Relation Tagging
PaperPDFCode(official)

Abstract

Identifying relations between objects is central to understanding the scene. While several works have been proposed for relation modeling in the image domain, there have been many constraints in the video domain due to challenging dynamics of spatio-temporal interactions (e.g., between which objects are there an interaction? when do relations start and end?). To date, two representative methods have been proposed to tackle Video Visual Relation Detection (VidVRD): segment-based and window-based. We first point out limitations of these methods and propose a novel approach named Temporal Span Proposal Network (TSPN). TSPN tells what to look: it sparsifies relation search space by scoring relationness of object pair, i.e., measuring how probable a relation exist. TSPN tells when to look: it simultaneously predicts start-end timestamps (i.e., temporal spans) and categories of the all possible relations by utilizing full video context. These two designs enable a win-win scenario: it accelerates training by 2X or more than existing methods and achieves competitive performance on two VidVRD benchmarks (ImageNet-VidVDR and VidOR). Moreover, comprehensive ablative experiments demonstrate the effectiveness of our approach. Codes are available at https://github.com/sangminwoo/Temporal-Span-Proposal-Network-VidVRD.

Results

TaskDatasetMetricValueModel
Scene ParsingImageNet-VidVRDRecall@10014.13TSPN
Scene ParsingImageNet-VidVRDRecall@5011.56TSPN
Scene ParsingImageNet-VidVRDmAP18.9TSPN
Scene ParsingVidORRecall@10010.71TSPN
Scene ParsingVidORRecall@509.33TSPN
Scene ParsingVidORmAP7.61TSPN
Visual Relationship DetectionImageNet-VidVRDRecall@10014.13TSPN
Visual Relationship DetectionImageNet-VidVRDRecall@5011.56TSPN
Visual Relationship DetectionImageNet-VidVRDmAP18.9TSPN
Visual Relationship DetectionVidORRecall@10010.71TSPN
Visual Relationship DetectionVidORRecall@509.33TSPN
Visual Relationship DetectionVidORmAP7.61TSPN
Scene UnderstandingImageNet-VidVRDRecall@10014.13TSPN
Scene UnderstandingImageNet-VidVRDRecall@5011.56TSPN
Scene UnderstandingImageNet-VidVRDmAP18.9TSPN
Scene UnderstandingVidORRecall@10010.71TSPN
Scene UnderstandingVidORRecall@509.33TSPN
Scene UnderstandingVidORmAP7.61TSPN
2D Semantic SegmentationImageNet-VidVRDRecall@10014.13TSPN
2D Semantic SegmentationImageNet-VidVRDRecall@5011.56TSPN
2D Semantic SegmentationImageNet-VidVRDmAP18.9TSPN
2D Semantic SegmentationVidORRecall@10010.71TSPN
2D Semantic SegmentationVidORRecall@509.33TSPN
2D Semantic SegmentationVidORmAP7.61TSPN

Related Papers

OpenVidVRD: Open-Vocabulary Video Visual Relation Detection via Prompt-Driven Semantic Space Alignment2025-03-12VrdONE: One-stage Video Visual Relation Detection2024-08-18SportsHHI: A Dataset for Human-Human Interaction Detection in Sports Videos2024-04-06In Defense of Clip-based Video Relation Detection2023-07-18Compositional Prompt Tuning with Motion Cues for Open-vocabulary Video Relation Detection2023-02-01Meta Spatio-Temporal Debiasing for Video Scene Graph Generation2022-07-23VRDFormer: End-to-End Video Visual Relation Detection With Transformers2022-01-01Video Relation Detection via Tracklet based Visual Transformer2021-08-19