SeMask: Semantically Masked Transformers for Semantic Segmentation

Jitesh Jain, Anukriti Singh, Nikita Orlov, Zilong Huang, Jiachen Li, Steven Walton, Humphrey Shi

2021-12-23arXiv 2021 12Semantic Segmentation

Abstract

Finetuning a pretrained backbone in the encoder part of an image transformer network has been the traditional approach for the semantic segmentation task. However, such an approach leaves out the semantic context that an image provides during the encoding stage. This paper argues that incorporating semantic information of the image into pretrained hierarchical transformer-based backbones while finetuning improves the performance considerably. To achieve this, we propose SeMask, a simple and effective framework that incorporates semantic information into the encoder with the help of a semantic attention operation. In addition, we use a lightweight semantic decoder during training to provide supervision to the intermediate semantic prior maps at every stage. Our experiments demonstrate that incorporating semantic priors enhances the performance of the established hierarchical encoders with a slight increase in the number of FLOPs. We provide empirical proof by integrating SeMask into Swin Transformer and Mix Transformer backbones as our encoder paired with different decoders. Our framework achieves a new state-of-the-art of 58.25% mIoU on the ADE20K dataset and improvements of over 3% in the mIoU metric on the Cityscapes dataset. The code and checkpoints are publicly available at https://github.com/Picsart-AI-Research/SeMask-Segmentation .

Results

Task	Dataset	Metric	Value	Model
Semantic Segmentation	Cityscapes val	mIoU	84.98	SeMask (SeMask Swin-L Mask2Former)
Semantic Segmentation	Cityscapes val	mIoU	80.39	SeMask (SeMask Swin-L FPN)
Semantic Segmentation	ADE20K val	mIoU	58.2	SeMask (SeMask Swin-L FaPN-Mask2Former)
Semantic Segmentation	ADE20K val	mIoU	58.2	SeMask (SeMask Swin-L MSFaPN-Mask2Former)
Semantic Segmentation	ADE20K val	mIoU	57.5	SeMask (SeMask Swin-L Mask2Former)
Semantic Segmentation	ADE20K val	mIoU	57	SeMask (SeMask Swin-L MSFaPN-Mask2Former, single-scale)
Semantic Segmentation	ADE20K val	mIoU	56.2	SeMask (SeMask Swin-L MaskFormer)
Semantic Segmentation	ADE20K val	mIoU	53.5	SeMask (SeMask Swin-L FPN)
Semantic Segmentation	ADE20K	Validation mIoU	58.2	SeMask (SeMask Swin-L FaPN-Mask2Former)
Semantic Segmentation	ADE20K	Validation mIoU	58.2	SeMask (SeMask Swin-L MSFaPN-Mask2Former)
Semantic Segmentation	ADE20K	Validation mIoU	57.5	SeMask (SeMask Swin-L Mask2Former)
Semantic Segmentation	ADE20K	Validation mIoU	57	SeMask(SeMask Swin-L MSFaPN-Mask2Former, single-scale)
Semantic Segmentation	ADE20K	Validation mIoU	56.2	SeMask (SeMask Swin-L MaskFormer)
Semantic Segmentation	ADE20K	Validation mIoU	53.52	SeMask (SeMask Swin-L FPN)
Semantic Segmentation	ADE20K	Params (M)	96	SeMask (SeMask Swin-B FPN)
Semantic Segmentation	ADE20K	Validation mIoU	50.98	SeMask (SeMask Swin-B FPN)
Semantic Segmentation	ADE20K	Params (M)	56	SeMask (SeMask Swin-S FPN)
Semantic Segmentation	ADE20K	Validation mIoU	47.63	SeMask (SeMask Swin-S FPN)
Semantic Segmentation	ADE20K	Params (M)	35	SeMask (SeMask Swin-T FPN)
Semantic Segmentation	ADE20K	Validation mIoU	43.16	SeMask (SeMask Swin-T FPN)
10-shot image generation	Cityscapes val	mIoU	84.98	SeMask (SeMask Swin-L Mask2Former)
10-shot image generation	Cityscapes val	mIoU	80.39	SeMask (SeMask Swin-L FPN)
10-shot image generation	ADE20K val	mIoU	58.2	SeMask (SeMask Swin-L FaPN-Mask2Former)
10-shot image generation	ADE20K val	mIoU	58.2	SeMask (SeMask Swin-L MSFaPN-Mask2Former)
10-shot image generation	ADE20K val	mIoU	57.5	SeMask (SeMask Swin-L Mask2Former)
10-shot image generation	ADE20K val	mIoU	57	SeMask (SeMask Swin-L MSFaPN-Mask2Former, single-scale)
10-shot image generation	ADE20K val	mIoU	56.2	SeMask (SeMask Swin-L MaskFormer)
10-shot image generation	ADE20K val	mIoU	53.5	SeMask (SeMask Swin-L FPN)
10-shot image generation	ADE20K	Validation mIoU	58.2	SeMask (SeMask Swin-L FaPN-Mask2Former)
10-shot image generation	ADE20K	Validation mIoU	58.2	SeMask (SeMask Swin-L MSFaPN-Mask2Former)
10-shot image generation	ADE20K	Validation mIoU	57.5	SeMask (SeMask Swin-L Mask2Former)
10-shot image generation	ADE20K	Validation mIoU	57	SeMask(SeMask Swin-L MSFaPN-Mask2Former, single-scale)
10-shot image generation	ADE20K	Validation mIoU	56.2	SeMask (SeMask Swin-L MaskFormer)
10-shot image generation	ADE20K	Validation mIoU	53.52	SeMask (SeMask Swin-L FPN)
10-shot image generation	ADE20K	Params (M)	96	SeMask (SeMask Swin-B FPN)
10-shot image generation	ADE20K	Validation mIoU	50.98	SeMask (SeMask Swin-B FPN)
10-shot image generation	ADE20K	Params (M)	56	SeMask (SeMask Swin-S FPN)
10-shot image generation	ADE20K	Validation mIoU	47.63	SeMask (SeMask Swin-S FPN)
10-shot image generation	ADE20K	Params (M)	35	SeMask (SeMask Swin-T FPN)
10-shot image generation	ADE20K	Validation mIoU	43.16	SeMask (SeMask Swin-T FPN)

Abstract

Results

Task	Dataset	Metric	Value	Model
Semantic Segmentation	Cityscapes val	mIoU	84.98	SeMask (SeMask Swin-L Mask2Former)
Semantic Segmentation	Cityscapes val	mIoU	80.39	SeMask (SeMask Swin-L FPN)
Semantic Segmentation	ADE20K val	mIoU	58.2	SeMask (SeMask Swin-L FaPN-Mask2Former)
Semantic Segmentation	ADE20K val	mIoU	58.2	SeMask (SeMask Swin-L MSFaPN-Mask2Former)
Semantic Segmentation	ADE20K val	mIoU	57.5	SeMask (SeMask Swin-L Mask2Former)
Semantic Segmentation	ADE20K val	mIoU	57	SeMask (SeMask Swin-L MSFaPN-Mask2Former, single-scale)
Semantic Segmentation	ADE20K val	mIoU	56.2	SeMask (SeMask Swin-L MaskFormer)
Semantic Segmentation	ADE20K val	mIoU	53.5	SeMask (SeMask Swin-L FPN)
Semantic Segmentation	ADE20K	Validation mIoU	58.2	SeMask (SeMask Swin-L FaPN-Mask2Former)
Semantic Segmentation	ADE20K	Validation mIoU	58.2	SeMask (SeMask Swin-L MSFaPN-Mask2Former)
Semantic Segmentation	ADE20K	Validation mIoU	57.5	SeMask (SeMask Swin-L Mask2Former)
Semantic Segmentation	ADE20K	Validation mIoU	57	SeMask(SeMask Swin-L MSFaPN-Mask2Former, single-scale)
Semantic Segmentation	ADE20K	Validation mIoU	56.2	SeMask (SeMask Swin-L MaskFormer)
Semantic Segmentation	ADE20K	Validation mIoU	53.52	SeMask (SeMask Swin-L FPN)
Semantic Segmentation	ADE20K	Params (M)	96	SeMask (SeMask Swin-B FPN)
Semantic Segmentation	ADE20K	Validation mIoU	50.98	SeMask (SeMask Swin-B FPN)
Semantic Segmentation	ADE20K	Params (M)	56	SeMask (SeMask Swin-S FPN)
Semantic Segmentation	ADE20K	Validation mIoU	47.63	SeMask (SeMask Swin-S FPN)
Semantic Segmentation	ADE20K	Params (M)	35	SeMask (SeMask Swin-T FPN)
Semantic Segmentation	ADE20K	Validation mIoU	43.16	SeMask (SeMask Swin-T FPN)
10-shot image generation	Cityscapes val	mIoU	84.98	SeMask (SeMask Swin-L Mask2Former)
10-shot image generation	Cityscapes val	mIoU	80.39	SeMask (SeMask Swin-L FPN)
10-shot image generation	ADE20K val	mIoU	58.2	SeMask (SeMask Swin-L FaPN-Mask2Former)
10-shot image generation	ADE20K val	mIoU	58.2	SeMask (SeMask Swin-L MSFaPN-Mask2Former)
10-shot image generation	ADE20K val	mIoU	57.5	SeMask (SeMask Swin-L Mask2Former)
10-shot image generation	ADE20K val	mIoU	57	SeMask (SeMask Swin-L MSFaPN-Mask2Former, single-scale)
10-shot image generation	ADE20K val	mIoU	56.2	SeMask (SeMask Swin-L MaskFormer)
10-shot image generation	ADE20K val	mIoU	53.5	SeMask (SeMask Swin-L FPN)
10-shot image generation	ADE20K	Validation mIoU	58.2	SeMask (SeMask Swin-L FaPN-Mask2Former)
10-shot image generation	ADE20K	Validation mIoU	58.2	SeMask (SeMask Swin-L MSFaPN-Mask2Former)
10-shot image generation	ADE20K	Validation mIoU	57.5	SeMask (SeMask Swin-L Mask2Former)
10-shot image generation	ADE20K	Validation mIoU	57	SeMask(SeMask Swin-L MSFaPN-Mask2Former, single-scale)
10-shot image generation	ADE20K	Validation mIoU	56.2	SeMask (SeMask Swin-L MaskFormer)
10-shot image generation	ADE20K	Validation mIoU	53.52	SeMask (SeMask Swin-L FPN)
10-shot image generation	ADE20K	Params (M)	96	SeMask (SeMask Swin-B FPN)
10-shot image generation	ADE20K	Validation mIoU	50.98	SeMask (SeMask Swin-B FPN)
10-shot image generation	ADE20K	Params (M)	56	SeMask (SeMask Swin-S FPN)
10-shot image generation	ADE20K	Validation mIoU	47.63	SeMask (SeMask Swin-S FPN)
10-shot image generation	ADE20K	Params (M)	35	SeMask (SeMask Swin-T FPN)
10-shot image generation	ADE20K	Validation mIoU	43.16	SeMask (SeMask Swin-T FPN)

SeMask: Semantically Masked Transformers for Semantic Segmentation

Abstract

Results

Related Papers

SeMask: Semantically Masked Transformers for Semantic Segmentation

Abstract

Results

Related Papers