ST-HOI: A Spatial-Temporal Baseline for Human-Object Interaction Detection in Videos

Meng-Jiun Chiou, Chun-Yu Liao, Li-Wei Wang, Roger Zimmermann, Jiashi Feng

2021-05-25Action Detection Human-Object Interaction Detection Human-Object Interaction Anticipation Spatio-Temporal Action Localization

Paper PDF Code(official)

Abstract

Detecting human-object interactions (HOI) is an important step toward a comprehensive visual understanding of machines. While detecting non-temporal HOIs (e.g., sitting on a chair) from static images is feasible, it is unlikely even for humans to guess temporal-related HOIs (e.g., opening/closing a door) from a single video frame, where the neighboring frames play an essential role. However, conventional HOI methods operating on only static images have been used to predict temporal-related interactions, which is essentially guessing without temporal contexts and may lead to sub-optimal performance. In this paper, we bridge this gap by detecting video-based HOIs with explicit temporal information. We first show that a naive temporal-aware variant of a common action detection baseline does not work on video-based HOIs due to a feature-inconsistency issue. We then propose a simple yet effective architecture named Spatial-Temporal HOI Detection (ST-HOI) utilizing temporal information such as human and object trajectories, correctly-localized visual features, and spatial-temporal masking pose features. We construct a new video HOI benchmark dubbed VidHOI where our proposed approach serves as a solid baseline.

Results

Task	Dataset	Metric	Value	Model
Human-Object Interaction Detection	VidHOI	Detection: Full (mAP@0.5)	7.61	STTRAN
Human-Object Interaction Detection	VidHOI	Detection: Non-Rare (mAP@0.5)	13.18	STTRAN
Human-Object Interaction Detection	VidHOI	Detection: Rare (mAP@0.5)	3.33	STTRAN
Human-Object Interaction Detection	VidHOI	Oracle: Full (mAP@0.5)	28.32	STTRAN
Human-Object Interaction Detection	VidHOI	Oracle: Non-Rare (mAP@0.5)	42.08	STTRAN
Human-Object Interaction Detection	VidHOI	Oracle: Rare (mAP@0.5)	17.74	STTRAN
Human-Object Interaction Anticipation	VidHOI	Person-wise Top5: t=1(mAP@0.5)	29.09	STTRAN
Human-Object Interaction Anticipation	VidHOI	Person-wise Top5: t=3(mAP@0.5)	27.59	STTRAN
Human-Object Interaction Anticipation	VidHOI	Person-wise Top5: t=5(mAP@0.5)	27.32	STTRAN

ST-HOI: A Spatial-Temporal Baseline for Human-Object Interaction Detection in Videos

Abstract

Results

Related Papers

ST-HOI: A Spatial-Temporal Baseline for Human-Object Interaction Detection in Videos

Abstract

Results

Related Papers