Polezhaev Ignat, Goncharenko Igor, Iurina Natalya
In this paper, we present a novel methodology we call MDS-ViTNet (Multi Decoder Saliency by Vision Transformer Network) for enhancing visual saliency prediction or eye-tracking. This approach holds significant potential for diverse fields, including marketing, medicine, robotics, and retail. We propose a network architecture that leverages the Vision Transformer, moving beyond the conventional ImageNet backbone. The framework adopts an encoder-decoder structure, with the encoder utilizing a Swin transformer to efficiently embed most important features. This process involves a Transfer Learning method, wherein layers from the Vision Transformer are converted by the Encoder Transformer and seamlessly integrated into a CNN Decoder. This methodology ensures minimal information loss from the original input image. The decoder employs a multi-decoding technique, utilizing dual decoders to generate two distinct attention maps. These maps are subsequently combined into a singular output via an additional CNN model. Our trained model MDS-ViTNet achieves state-of-the-art results across several benchmarks. Committed to fostering further collaboration, we intend to make our code, models, and datasets accessible to the public.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Saliency Detection | SALICON | AUC | 0.8684 | MDS-ViTNet |
| Saliency Detection | SALICON | CC | 0.898 | MDS-ViTNet |
| Saliency Detection | SALICON | KLD | 0.2127 | MDS-ViTNet |
| Saliency Detection | SALICON | SIM | 0.7887 | MDS-ViTNet |
| Saliency Prediction | SALICON | AUC | 0.8684 | MDS-ViTNet |
| Saliency Prediction | SALICON | CC | 0.898 | MDS-ViTNet |
| Saliency Prediction | SALICON | KLD | 0.2127 | MDS-ViTNet |
| Saliency Prediction | SALICON | SIM | 0.7887 | MDS-ViTNet |
| Few-Shot Transfer Learning for Saliency Prediction | SALICON | AUC | 0.8684 | MDS-ViTNet |
| Few-Shot Transfer Learning for Saliency Prediction | SALICON | CC | 0.898 | MDS-ViTNet |
| Few-Shot Transfer Learning for Saliency Prediction | SALICON | KLD | 0.2127 | MDS-ViTNet |
| Few-Shot Transfer Learning for Saliency Prediction | SALICON | SIM | 0.7887 | MDS-ViTNet |