ViTGaze: Gaze Following with Interaction Features in Vision Transformers

Yuehao Song, Xinggang Wang, Jingfeng Yao, Wenyu Liu, Jinglin Zhang, Xiangmin Xu

2024-03-19Gaze Target Estimation

Abstract

Gaze following aims to interpret human-scene interactions by predicting the person's focal point of gaze. Prevailing approaches often adopt a two-stage framework, whereby multi-modality information is extracted in the initial stage for gaze target prediction. Consequently, the efficacy of these methods highly depends on the precision of the preceding modality extraction. Others use a single-modality approach with complex decoders, increasing network computational load. Inspired by the remarkable success of pre-trained plain vision transformers (ViTs), we introduce a novel single-modality gaze following framework called ViTGaze. In contrast to previous methods, it creates a novel gaze following framework based mainly on powerful encoders (relative decoder parameters less than 1%). Our principal insight is that the inter-token interactions within self-attention can be transferred to interactions between humans and scenes. Leveraging this presumption, we formulate a framework consisting of a 4D interaction encoder and a 2D spatial guidance module to extract human-scene interaction information from self-attention maps. Furthermore, our investigation reveals that ViT with self-supervised pre-training has an enhanced ability to extract correlation information. Many experiments have been conducted to demonstrate the performance of the proposed method. Our method achieves state-of-the-art (SOTA) performance among all single-modality methods (3.4% improvement in the area under curve (AUC) score, 5.1% improvement in the average precision (AP)) and very comparable performance against multi-modality methods with 59% number of parameters less.

Results

Task	Dataset	Metric	Value	Model
Gaze Target Estimation	VideoAttentionTarget	AP	0.905	ViTGaze
Gaze Target Estimation	VideoAttentionTarget	AUC	0.938	ViTGaze
Gaze Target Estimation	VideoAttentionTarget	Average Distance	0.102	ViTGaze
Gaze Target Estimation	GazeFollow	AUC	0.949	ViTGaze
Gaze Target Estimation	GazeFollow	Average Distance	0.105	ViTGaze

Abstract

Results

Task	Dataset	Metric	Value	Model
Gaze Target Estimation	VideoAttentionTarget	AP	0.905	ViTGaze
Gaze Target Estimation	VideoAttentionTarget	AUC	0.938	ViTGaze
Gaze Target Estimation	VideoAttentionTarget	Average Distance	0.102	ViTGaze
Gaze Target Estimation	GazeFollow	AUC	0.949	ViTGaze
Gaze Target Estimation	GazeFollow	Average Distance	0.105	ViTGaze

ViTGaze: Gaze Following with Interaction Features in Vision Transformers

Abstract

Results

Related Papers

ViTGaze: Gaze Following with Interaction Features in Vision Transformers

Abstract

Results

Related Papers