TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/ViTGaze: Gaze Following with Interaction Features in Visio...

ViTGaze: Gaze Following with Interaction Features in Vision Transformers

Yuehao Song, Xinggang Wang, Jingfeng Yao, Wenyu Liu, Jinglin Zhang, Xiangmin Xu

2024-03-19Gaze Target Estimation
PaperPDFCode(official)

Abstract

Gaze following aims to interpret human-scene interactions by predicting the person's focal point of gaze. Prevailing approaches often adopt a two-stage framework, whereby multi-modality information is extracted in the initial stage for gaze target prediction. Consequently, the efficacy of these methods highly depends on the precision of the preceding modality extraction. Others use a single-modality approach with complex decoders, increasing network computational load. Inspired by the remarkable success of pre-trained plain vision transformers (ViTs), we introduce a novel single-modality gaze following framework called ViTGaze. In contrast to previous methods, it creates a novel gaze following framework based mainly on powerful encoders (relative decoder parameters less than 1%). Our principal insight is that the inter-token interactions within self-attention can be transferred to interactions between humans and scenes. Leveraging this presumption, we formulate a framework consisting of a 4D interaction encoder and a 2D spatial guidance module to extract human-scene interaction information from self-attention maps. Furthermore, our investigation reveals that ViT with self-supervised pre-training has an enhanced ability to extract correlation information. Many experiments have been conducted to demonstrate the performance of the proposed method. Our method achieves state-of-the-art (SOTA) performance among all single-modality methods (3.4% improvement in the area under curve (AUC) score, 5.1% improvement in the average precision (AP)) and very comparable performance against multi-modality methods with 59% number of parameters less.

Results

TaskDatasetMetricValueModel
Gaze Target EstimationVideoAttentionTargetAP0.905ViTGaze
Gaze Target EstimationVideoAttentionTargetAUC0.938ViTGaze
Gaze Target EstimationVideoAttentionTargetAverage Distance0.102ViTGaze
Gaze Target EstimationGazeFollowAUC0.949ViTGaze
Gaze Target EstimationGazeFollowAverage Distance0.105ViTGaze

Related Papers

Gaze-LLE: Gaze Target Estimation via Large-Scale Learned Encoders2024-12-12GazeHTA: End-to-end Gaze Target Detection with Head-Target Association2024-04-16A Gaze-grounded Visual Question Answering Dataset for Clarifying Ambiguous Japanese Questions2024-03-26Multimodal Across Domains Gaze Target Detection2022-08-23Where are they looking?2015-12-01