Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers

Sixiao Zheng, Jiachen Lu, Hengshuang Zhao, Xiatian Zhu, Zekun Luo, Yabiao Wang, Yanwei Fu, Jianfeng Feng, Tao Xiang, Philip H. S. Torr, Li Zhang

2020-12-31CVPR 2021 1Segmentation Semantic Segmentation Medical Image Segmentation

Paper PDF Code Code(official)Code Code Code

Abstract

Most recent semantic segmentation methods adopt a fully-convolutional network (FCN) with an encoder-decoder architecture. The encoder progressively reduces the spatial resolution and learns more abstract/semantic visual concepts with larger receptive fields. Since context modeling is critical for segmentation, the latest efforts have been focused on increasing the receptive field, through either dilated/atrous convolutions or inserting attention modules. However, the encoder-decoder based FCN architecture remains unchanged. In this paper, we aim to provide an alternative perspective by treating semantic segmentation as a sequence-to-sequence prediction task. Specifically, we deploy a pure transformer (ie, without convolution and resolution reduction) to encode an image as a sequence of patches. With the global context modeled in every layer of the transformer, this encoder can be combined with a simple decoder to provide a powerful segmentation model, termed SEgmentation TRansformer (SETR). Extensive experiments show that SETR achieves new state of the art on ADE20K (50.28% mIoU), Pascal Context (55.83% mIoU) and competitive results on Cityscapes. Particularly, we achieve the first position in the highly competitive ADE20K test server leaderboard on the day of submission.

Results

Task	Dataset	Metric	Value	Model
Medical Image Segmentation	Synapse multi-organ CT	Avg DSC	79.6	SETR
Semantic Segmentation	Cityscapes val	mIoU	82.15	SETR-PUP (80k, MS)
Semantic Segmentation	PASCAL Context	mIoU	55.83	SETR-MLA (16, 80k, MS)
Semantic Segmentation	FoodSeg103	mIoU	45.1	SeTR-MLA (ViT-16/B)
Semantic Segmentation	FoodSeg103	mIoU	41.3	SeTR-Naive (ViT-16/B)
Semantic Segmentation	UrbanLF	mIoU (Real)	77.74	SETR (ViT-Large)
Semantic Segmentation	UrbanLF	mIoU (Syn)	77.69	SETR (ViT-Large)
Semantic Segmentation	DADA-seg	mIoU	31.8	SETR (PUP, Transformer-Large)
Semantic Segmentation	DADA-seg	mIoU	30.4	SETR (MLA, Transformer-Large)
Semantic Segmentation	ADE20K	Validation mIoU	50.28	SETR-MLA (160k, MS)
10-shot image generation	Cityscapes val	mIoU	82.15	SETR-PUP (80k, MS)
10-shot image generation	PASCAL Context	mIoU	55.83	SETR-MLA (16, 80k, MS)
10-shot image generation	FoodSeg103	mIoU	45.1	SeTR-MLA (ViT-16/B)
10-shot image generation	FoodSeg103	mIoU	41.3	SeTR-Naive (ViT-16/B)
10-shot image generation	UrbanLF	mIoU (Real)	77.74	SETR (ViT-Large)
10-shot image generation	UrbanLF	mIoU (Syn)	77.69	SETR (ViT-Large)
10-shot image generation	DADA-seg	mIoU	31.8	SETR (PUP, Transformer-Large)
10-shot image generation	DADA-seg	mIoU	30.4	SETR (MLA, Transformer-Large)
10-shot image generation	ADE20K	Validation mIoU	50.28	SETR-MLA (160k, MS)

Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers

Abstract

Results

Related Papers

Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers

Abstract

Results

Related Papers