Gaze-Guided Learning: Avoiding Shortcut Bias in Visual Classification

Jiahang Li, Shibo Xue, Yong Su

2025-04-08Image Classification cross-modal alignment Multimodal Deep Learning Time Series Alignment Time Series

Abstract

Inspired by human visual attention, deep neural networks have widely adopted attention mechanisms to learn locally discriminative attributes for challenging visual classification tasks. However, existing approaches primarily emphasize the representation of such features while neglecting their precise localization, which often leads to misclassification caused by shortcut biases. This limitation becomes even more pronounced when models are evaluated on transfer or out-of-distribution datasets. In contrast, humans are capable of leveraging prior object knowledge to quickly localize and compare fine-grained attributes, a capability that is especially crucial in complex and high-variance classification scenarios. Motivated by this, we introduce Gaze-CIFAR-10, a human gaze time-series dataset, along with a dual-sequence gaze encoder that models the precise sequential localization of human attention on distinct local attributes. In parallel, a Vision Transformer (ViT) is employed to learn the sequential representation of image content. Through cross-modal fusion, our framework integrates human gaze priors with machine-derived visual sequences, effectively correcting inaccurate localization in image feature representations. Extensive qualitative and quantitative experiments demonstrate that gaze-guided cognitive cues significantly enhance classification accuracy.

Results

Task	Dataset	Metric	Value	Model
Image Classification	Gaze-CIFAR-10	1:1 Accuracy	85.9	DSGE-ConvNeXtV2
Image Classification	Gaze-CIFAR-10	1:1 Accuracy	84.2	DSGE-ViT

Related Papers

Automatic Classification and Segmentation of Tunnel Cracks Based on Deep Learning and Visual Explanations2025-07-18 Adversarial attacks to image classification systems using evolutionary algorithms2025-07-17 Efficient Adaptation of Pre-trained Vision Transformer underpinned by Approximately Orthogonal Fine-Tuning Strategy2025-07-17 Federated Learning for Commercial Image Sources2025-07-17 MUPAX: Multidimensional Problem Agnostic eXplainable AI2025-07-17 Transformer-based Spatial Grounding: A Comprehensive Survey2025-07-17 MoTM: Towards a Foundation Model for Time Series Imputation based on Continuous Modeling2025-07-17 The Power of Architecture: Deep Dive into Transformer Architectures for Long-Term Time Series Forecasting2025-07-17