TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Human-Object Interaction Prediction in Videos through Gaze...

Human-Object Interaction Prediction in Videos through Gaze Following

Zhifan Ni, Esteve Valls MascarĂ³, Hyemin Ahn, Dongheui Lee

2023-06-06Human-Object Interaction DetectionHuman-Object Interaction Anticipation
PaperPDFCode(official)

Abstract

Understanding the human-object interactions (HOIs) from a video is essential to fully comprehend a visual scene. This line of research has been addressed by detecting HOIs from images and lately from videos. However, the video-based HOI anticipation task in the third-person view remains understudied. In this paper, we design a framework to detect current HOIs and anticipate future HOIs in videos. We propose to leverage human gaze information since people often fixate on an object before interacting with it. These gaze features together with the scene contexts and the visual appearances of human-object pairs are fused through a spatio-temporal transformer. To evaluate the model in the HOI anticipation task in a multi-person scenario, we propose a set of person-wise multi-label metrics. Our model is trained and validated on the VidHOI dataset, which contains videos capturing daily life and is currently the largest video HOI dataset. Experimental results in the HOI detection task show that our approach improves the baseline by a great margin of 36.3% relatively. Moreover, we conduct an extensive ablation study to demonstrate the effectiveness of our modifications and extensions to the spatio-temporal transformer. Our code is publicly available on https://github.com/nizhf/hoi-prediction-gaze-transformer.

Results

TaskDatasetMetricValueModel
Human-Object Interaction DetectionVidHOIDetection: Full (mAP@0.5)10.4ST-GAZE
Human-Object Interaction DetectionVidHOIDetection: Non-Rare (mAP@0.5)16.83ST-GAZE
Human-Object Interaction DetectionVidHOIDetection: Rare (mAP@0.5)5.46ST-GAZE
Human-Object Interaction DetectionVidHOIOracle: Full (mAP@0.5)38.61ST-GAZE
Human-Object Interaction DetectionVidHOIOracle: Non-Rare (mAP@0.5)52.44ST-GAZE
Human-Object Interaction DetectionVidHOIOracle: Rare (mAP@0.5)27.99ST-GAZE
Human-Object Interaction AnticipationVidHOIPerson-wise Top5: t=1(mAP@0.5)37.59ST-GAZE
Human-Object Interaction AnticipationVidHOIPerson-wise Top5: t=3(mAP@0.5)33.14ST-GAZE
Human-Object Interaction AnticipationVidHOIPerson-wise Top5: t=5(mAP@0.5)32.75ST-GAZE

Related Papers

RoHOI: Robustness Benchmark for Human-Object Interaction Detection2025-07-12Bilateral Collaboration with Large Vision-Language Models for Open Vocabulary Human-Object Interaction Detection2025-07-09VolumetricSMPL: A Neural Volumetric Body Model for Efficient Interactions, Contacts, and Collisions2025-06-29HOIverse: A Synthetic Scene Graph Dataset With Human Object Interactions2025-06-24On the Robustness of Human-Object Interaction Detection against Distribution Shift2025-06-22Egocentric Human-Object Interaction Detection: A New Benchmark and Method2025-06-17InterActHuman: Multi-Concept Human Animation with Layout-Aligned Audio Conditions2025-06-11HunyuanVideo-HOMA: Generic Human-Object Interaction in Multimodal Driven Human Animation2025-06-10