TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/ST-HOI: A Spatial-Temporal Baseline for Human-Object Inter...

ST-HOI: A Spatial-Temporal Baseline for Human-Object Interaction Detection in Videos

Meng-Jiun Chiou, Chun-Yu Liao, Li-Wei Wang, Roger Zimmermann, Jiashi Feng

2021-05-25Action DetectionHuman-Object Interaction DetectionHuman-Object Interaction AnticipationSpatio-Temporal Action Localization
PaperPDFCode(official)

Abstract

Detecting human-object interactions (HOI) is an important step toward a comprehensive visual understanding of machines. While detecting non-temporal HOIs (e.g., sitting on a chair) from static images is feasible, it is unlikely even for humans to guess temporal-related HOIs (e.g., opening/closing a door) from a single video frame, where the neighboring frames play an essential role. However, conventional HOI methods operating on only static images have been used to predict temporal-related interactions, which is essentially guessing without temporal contexts and may lead to sub-optimal performance. In this paper, we bridge this gap by detecting video-based HOIs with explicit temporal information. We first show that a naive temporal-aware variant of a common action detection baseline does not work on video-based HOIs due to a feature-inconsistency issue. We then propose a simple yet effective architecture named Spatial-Temporal HOI Detection (ST-HOI) utilizing temporal information such as human and object trajectories, correctly-localized visual features, and spatial-temporal masking pose features. We construct a new video HOI benchmark dubbed VidHOI where our proposed approach serves as a solid baseline.

Results

TaskDatasetMetricValueModel
Human-Object Interaction DetectionVidHOIDetection: Full (mAP@0.5)7.61STTRAN
Human-Object Interaction DetectionVidHOIDetection: Non-Rare (mAP@0.5)13.18STTRAN
Human-Object Interaction DetectionVidHOIDetection: Rare (mAP@0.5)3.33STTRAN
Human-Object Interaction DetectionVidHOIOracle: Full (mAP@0.5)28.32STTRAN
Human-Object Interaction DetectionVidHOIOracle: Non-Rare (mAP@0.5)42.08STTRAN
Human-Object Interaction DetectionVidHOIOracle: Rare (mAP@0.5)17.74STTRAN
Human-Object Interaction AnticipationVidHOIPerson-wise Top5: t=1(mAP@0.5)29.09STTRAN
Human-Object Interaction AnticipationVidHOIPerson-wise Top5: t=3(mAP@0.5)27.59STTRAN
Human-Object Interaction AnticipationVidHOIPerson-wise Top5: t=5(mAP@0.5)27.32STTRAN

Related Papers

RoHOI: Robustness Benchmark for Human-Object Interaction Detection2025-07-12Bilateral Collaboration with Large Vision-Language Models for Open Vocabulary Human-Object Interaction Detection2025-07-09VolumetricSMPL: A Neural Volumetric Body Model for Efficient Interactions, Contacts, and Collisions2025-06-29CBF-AFA: Chunk-Based Multi-SSL Fusion for Automatic Fluency Assessment2025-06-25MultiHuman-Testbench: Benchmarking Image Generation for Multiple Humans2025-06-25HOIverse: A Synthetic Scene Graph Dataset With Human Object Interactions2025-06-24On the Robustness of Human-Object Interaction Detection against Distribution Shift2025-06-22Distributed Activity Detection for Cell-Free Hybrid Near-Far Field Communications2025-06-17