Sixiao Zheng, Jiachen Lu, Hengshuang Zhao, Xiatian Zhu, Zekun Luo, Yabiao Wang, Yanwei Fu, Jianfeng Feng, Tao Xiang, Philip H. S. Torr, Li Zhang
Most recent semantic segmentation methods adopt a fully-convolutional network (FCN) with an encoder-decoder architecture. The encoder progressively reduces the spatial resolution and learns more abstract/semantic visual concepts with larger receptive fields. Since context modeling is critical for segmentation, the latest efforts have been focused on increasing the receptive field, through either dilated/atrous convolutions or inserting attention modules. However, the encoder-decoder based FCN architecture remains unchanged. In this paper, we aim to provide an alternative perspective by treating semantic segmentation as a sequence-to-sequence prediction task. Specifically, we deploy a pure transformer (ie, without convolution and resolution reduction) to encode an image as a sequence of patches. With the global context modeled in every layer of the transformer, this encoder can be combined with a simple decoder to provide a powerful segmentation model, termed SEgmentation TRansformer (SETR). Extensive experiments show that SETR achieves new state of the art on ADE20K (50.28% mIoU), Pascal Context (55.83% mIoU) and competitive results on Cityscapes. Particularly, we achieve the first position in the highly competitive ADE20K test server leaderboard on the day of submission.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Medical Image Segmentation | Synapse multi-organ CT | Avg DSC | 79.6 | SETR |
| Semantic Segmentation | Cityscapes val | mIoU | 82.15 | SETR-PUP (80k, MS) |
| Semantic Segmentation | PASCAL Context | mIoU | 55.83 | SETR-MLA (16, 80k, MS) |
| Semantic Segmentation | FoodSeg103 | mIoU | 45.1 | SeTR-MLA (ViT-16/B) |
| Semantic Segmentation | FoodSeg103 | mIoU | 41.3 | SeTR-Naive (ViT-16/B) |
| Semantic Segmentation | UrbanLF | mIoU (Real) | 77.74 | SETR (ViT-Large) |
| Semantic Segmentation | UrbanLF | mIoU (Syn) | 77.69 | SETR (ViT-Large) |
| Semantic Segmentation | DADA-seg | mIoU | 31.8 | SETR (PUP, Transformer-Large) |
| Semantic Segmentation | DADA-seg | mIoU | 30.4 | SETR (MLA, Transformer-Large) |
| Semantic Segmentation | ADE20K | Validation mIoU | 50.28 | SETR-MLA (160k, MS) |
| 10-shot image generation | Cityscapes val | mIoU | 82.15 | SETR-PUP (80k, MS) |
| 10-shot image generation | PASCAL Context | mIoU | 55.83 | SETR-MLA (16, 80k, MS) |
| 10-shot image generation | FoodSeg103 | mIoU | 45.1 | SeTR-MLA (ViT-16/B) |
| 10-shot image generation | FoodSeg103 | mIoU | 41.3 | SeTR-Naive (ViT-16/B) |
| 10-shot image generation | UrbanLF | mIoU (Real) | 77.74 | SETR (ViT-Large) |
| 10-shot image generation | UrbanLF | mIoU (Syn) | 77.69 | SETR (ViT-Large) |
| 10-shot image generation | DADA-seg | mIoU | 31.8 | SETR (PUP, Transformer-Large) |
| 10-shot image generation | DADA-seg | mIoU | 30.4 | SETR (MLA, Transformer-Large) |
| 10-shot image generation | ADE20K | Validation mIoU | 50.28 | SETR-MLA (160k, MS) |