What and When to Look?: Temporal Span Proposal Network for Video Relation Detection

Sangmin Woo, Junhyug Noh, Kangil Kim

2021-07-15Video Visual Relation Detection Video Visual Relation Tagging

Abstract

Identifying relations between objects is central to understanding the scene. While several works have been proposed for relation modeling in the image domain, there have been many constraints in the video domain due to challenging dynamics of spatio-temporal interactions (e.g., between which objects are there an interaction? when do relations start and end?). To date, two representative methods have been proposed to tackle Video Visual Relation Detection (VidVRD): segment-based and window-based. We first point out limitations of these methods and propose a novel approach named Temporal Span Proposal Network (TSPN). TSPN tells what to look: it sparsifies relation search space by scoring relationness of object pair, i.e., measuring how probable a relation exist. TSPN tells when to look: it simultaneously predicts start-end timestamps (i.e., temporal spans) and categories of the all possible relations by utilizing full video context. These two designs enable a win-win scenario: it accelerates training by 2X or more than existing methods and achieves competitive performance on two VidVRD benchmarks (ImageNet-VidVDR and VidOR). Moreover, comprehensive ablative experiments demonstrate the effectiveness of our approach. Codes are available at https://github.com/sangminwoo/Temporal-Span-Proposal-Network-VidVRD.

Results

Task	Dataset	Metric	Value	Model
Scene Parsing	ImageNet-VidVRD	Recall@100	14.13	TSPN
Scene Parsing	ImageNet-VidVRD	Recall@50	11.56	TSPN
Scene Parsing	ImageNet-VidVRD	mAP	18.9	TSPN
Scene Parsing	VidOR	Recall@100	10.71	TSPN
Scene Parsing	VidOR	Recall@50	9.33	TSPN
Scene Parsing	VidOR	mAP	7.61	TSPN
Visual Relationship Detection	ImageNet-VidVRD	Recall@100	14.13	TSPN
Visual Relationship Detection	ImageNet-VidVRD	Recall@50	11.56	TSPN
Visual Relationship Detection	ImageNet-VidVRD	mAP	18.9	TSPN
Visual Relationship Detection	VidOR	Recall@100	10.71	TSPN
Visual Relationship Detection	VidOR	Recall@50	9.33	TSPN
Visual Relationship Detection	VidOR	mAP	7.61	TSPN
Scene Understanding	ImageNet-VidVRD	Recall@100	14.13	TSPN
Scene Understanding	ImageNet-VidVRD	Recall@50	11.56	TSPN
Scene Understanding	ImageNet-VidVRD	mAP	18.9	TSPN
Scene Understanding	VidOR	Recall@100	10.71	TSPN
Scene Understanding	VidOR	Recall@50	9.33	TSPN
Scene Understanding	VidOR	mAP	7.61	TSPN
2D Semantic Segmentation	ImageNet-VidVRD	Recall@100	14.13	TSPN
2D Semantic Segmentation	ImageNet-VidVRD	Recall@50	11.56	TSPN
2D Semantic Segmentation	ImageNet-VidVRD	mAP	18.9	TSPN
2D Semantic Segmentation	VidOR	Recall@100	10.71	TSPN
2D Semantic Segmentation	VidOR	Recall@50	9.33	TSPN
2D Semantic Segmentation	VidOR	mAP	7.61	TSPN

What and When to Look?: Temporal Span Proposal Network for Video Relation Detection

Abstract

Results

Related Papers

What and When to Look?: Temporal Span Proposal Network for Video Relation Detection

Abstract

Results

Related Papers