TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Interaction Region Visual Transformer for Egocentric Actio...

Interaction Region Visual Transformer for Egocentric Action Anticipation

Debaditya Roy, Ramanathan Rajendiran, Basura Fernando

2022-11-25IEEE WACV 2024 1Human-Object Interaction DetectionAction Anticipation
PaperPDFCode

Abstract

Human-object interaction is one of the most important visual cues and we propose a novel way to represent human-object interactions for egocentric action anticipation. We propose a novel transformer variant to model interactions by computing the change in the appearance of objects and human hands due to the execution of the actions and use those changes to refine the video representation. Specifically, we model interactions between hands and objects using Spatial Cross-Attention (SCA) and further infuse contextual information using Trajectory Cross-Attention to obtain environment-refined interaction tokens. Using these tokens, we construct an interaction-centric video representation for action anticipation. We term our model InAViT which achieves state-of-the-art action anticipation performance on large-scale egocentric datasets EPICKTICHENS100 (EK100) and EGTEA Gaze+. InAViT outperforms other visual transformer-based methods including object-centric video representation. On the EK100 evaluation server, InAViT is the top-performing method on the public leaderboard (at the time of submission) where it outperforms the second-best model by 3.3% on mean-top5 recall.

Results

TaskDatasetMetricValueModel
Activity RecognitionEPIC-KITCHENS-100 (test)recall@523.75InAViT
Activity RecognitionEGTEATop-1 Accuracy67.8InAViT
Activity RecognitionEPIC-KITCHENS-100Recall@525.89InAViT
Action RecognitionEPIC-KITCHENS-100 (test)recall@523.75InAViT
Action RecognitionEGTEATop-1 Accuracy67.8InAViT
Action RecognitionEPIC-KITCHENS-100Recall@525.89InAViT
Action AnticipationEPIC-KITCHENS-100 (test)recall@523.75InAViT
Action AnticipationEGTEATop-1 Accuracy67.8InAViT
Action AnticipationEPIC-KITCHENS-100Recall@525.89InAViT
2D Human Pose EstimationEPIC-KITCHENS-100 (test)recall@523.75InAViT
2D Human Pose EstimationEGTEATop-1 Accuracy67.8InAViT
2D Human Pose EstimationEPIC-KITCHENS-100Recall@525.89InAViT
Action Recognition In VideosEPIC-KITCHENS-100 (test)recall@523.75InAViT
Action Recognition In VideosEGTEATop-1 Accuracy67.8InAViT
Action Recognition In VideosEPIC-KITCHENS-100Recall@525.89InAViT

Related Papers

RoHOI: Robustness Benchmark for Human-Object Interaction Detection2025-07-12Bilateral Collaboration with Large Vision-Language Models for Open Vocabulary Human-Object Interaction Detection2025-07-09VolumetricSMPL: A Neural Volumetric Body Model for Efficient Interactions, Contacts, and Collisions2025-06-29HOIverse: A Synthetic Scene Graph Dataset With Human Object Interactions2025-06-24On the Robustness of Human-Object Interaction Detection against Distribution Shift2025-06-22Egocentric Human-Object Interaction Detection: A New Benchmark and Method2025-06-17InterActHuman: Multi-Concept Human Animation with Layout-Aligned Audio Conditions2025-06-11V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning2025-06-11