Visual Saliency Transformer

Nian Liu, Ni Zhang, Kaiyuan Wan, Ling Shao, Junwei Han

2021-04-25ICCV 2021 10Thermal Image Segmentation Boundary Detection Salient Object Detection RGB-D Salient Object Detection object-detection Object Detection Saliency Detection

Paper PDF Code Code

Abstract

Existing state-of-the-art saliency detection methods heavily rely on CNN-based architectures. Alternatively, we rethink this task from a convolution-free sequence-to-sequence perspective and predict saliency by modeling long-range dependencies, which can not be achieved by convolution. Specifically, we develop a novel unified model based on a pure transformer, namely, Visual Saliency Transformer (VST), for both RGB and RGB-D salient object detection (SOD). It takes image patches as inputs and leverages the transformer to propagate global contexts among image patches. Unlike conventional architectures used in Vision Transformer (ViT), we leverage multi-level token fusion and propose a new token upsampling method under the transformer framework to get high-resolution detection results. We also develop a token-based multi-task decoder to simultaneously perform saliency and boundary detection by introducing task-related tokens and a novel patch-task-attention mechanism. Experimental results show that our model outperforms existing methods on both RGB and RGB-D SOD benchmark datasets. Most importantly, our whole framework not only provides a new perspective for the SOD field but also shows a new paradigm for transformer-based dense prediction models. Code is available at https://github.com/nnizhang/VST.

Results

Task	Dataset	Metric	Value	Model
Semantic Segmentation	RGB-T-Glass-Segmentation	MAE	0.044	VST
Object Detection	SIP	Average MAE	0.04	VST
Object Detection	SIP	S-Measure	90.4	VST
Object Detection	SIP	max E-Measure	94.4	VST
Object Detection	SIP	max F-Measure	91.5	VST
Object Detection	NJUD	S-Measure	0.922	VST
Object Detection	NLPR	S-Measure	0.932	VST
3D	SIP	Average MAE	0.04	VST
3D	SIP	S-Measure	90.4	VST
3D	SIP	max E-Measure	94.4	VST
3D	SIP	max F-Measure	91.5	VST
3D	NJUD	S-Measure	0.922	VST
3D	NLPR	S-Measure	0.932	VST
2D Classification	SIP	Average MAE	0.04	VST
2D Classification	SIP	S-Measure	90.4	VST
2D Classification	SIP	max E-Measure	94.4	VST
2D Classification	SIP	max F-Measure	91.5	VST
2D Classification	NJUD	S-Measure	0.922	VST
2D Classification	NLPR	S-Measure	0.932	VST
Scene Segmentation	RGB-T-Glass-Segmentation	MAE	0.044	VST
2D Object Detection	SIP	Average MAE	0.04	VST
2D Object Detection	SIP	S-Measure	90.4	VST
2D Object Detection	SIP	max E-Measure	94.4	VST
2D Object Detection	SIP	max F-Measure	91.5	VST
2D Object Detection	NJUD	S-Measure	0.922	VST
2D Object Detection	NLPR	S-Measure	0.932	VST
2D Object Detection	RGB-T-Glass-Segmentation	MAE	0.044	VST
10-shot image generation	RGB-T-Glass-Segmentation	MAE	0.044	VST
16k	SIP	Average MAE	0.04	VST
16k	SIP	S-Measure	90.4	VST
16k	SIP	max E-Measure	94.4	VST
16k	SIP	max F-Measure	91.5	VST
16k	NJUD	S-Measure	0.922	VST
16k	NLPR	S-Measure	0.932	VST

Visual Saliency Transformer

Abstract

Results

Related Papers

Visual Saliency Transformer

Abstract

Results

Related Papers