Interaction Region Visual Transformer for Egocentric Action Anticipation

Debaditya Roy, Ramanathan Rajendiran, Basura Fernando

2022-11-25IEEE WACV 2024 1Human-Object Interaction Detection Action Anticipation

Abstract

Human-object interaction is one of the most important visual cues and we propose a novel way to represent human-object interactions for egocentric action anticipation. We propose a novel transformer variant to model interactions by computing the change in the appearance of objects and human hands due to the execution of the actions and use those changes to refine the video representation. Specifically, we model interactions between hands and objects using Spatial Cross-Attention (SCA) and further infuse contextual information using Trajectory Cross-Attention to obtain environment-refined interaction tokens. Using these tokens, we construct an interaction-centric video representation for action anticipation. We term our model InAViT which achieves state-of-the-art action anticipation performance on large-scale egocentric datasets EPICKTICHENS100 (EK100) and EGTEA Gaze+. InAViT outperforms other visual transformer-based methods including object-centric video representation. On the EK100 evaluation server, InAViT is the top-performing method on the public leaderboard (at the time of submission) where it outperforms the second-best model by 3.3% on mean-top5 recall.

Results

Task	Dataset	Metric	Value	Model
Activity Recognition	EPIC-KITCHENS-100 (test)	recall@5	23.75	InAViT
Activity Recognition	EGTEA	Top-1 Accuracy	67.8	InAViT
Activity Recognition	EPIC-KITCHENS-100	Recall@5	25.89	InAViT
Action Recognition	EPIC-KITCHENS-100 (test)	recall@5	23.75	InAViT
Action Recognition	EGTEA	Top-1 Accuracy	67.8	InAViT
Action Recognition	EPIC-KITCHENS-100	Recall@5	25.89	InAViT
Action Anticipation	EPIC-KITCHENS-100 (test)	recall@5	23.75	InAViT
Action Anticipation	EGTEA	Top-1 Accuracy	67.8	InAViT
Action Anticipation	EPIC-KITCHENS-100	Recall@5	25.89	InAViT
2D Human Pose Estimation	EPIC-KITCHENS-100 (test)	recall@5	23.75	InAViT
2D Human Pose Estimation	EGTEA	Top-1 Accuracy	67.8	InAViT
2D Human Pose Estimation	EPIC-KITCHENS-100	Recall@5	25.89	InAViT
Action Recognition In Videos	EPIC-KITCHENS-100 (test)	recall@5	23.75	InAViT
Action Recognition In Videos	EGTEA	Top-1 Accuracy	67.8	InAViT
Action Recognition In Videos	EPIC-KITCHENS-100	Recall@5	25.89	InAViT

Interaction Region Visual Transformer for Egocentric Action Anticipation

Abstract

Results

Related Papers

Interaction Region Visual Transformer for Egocentric Action Anticipation

Abstract

Results

Related Papers