Masked-attention Mask Transformer for Universal Image Segmentation

Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexander Kirillov, Rohit Girdhar

2021-12-02CVPR 2022 12D Semantic Segmentation Panoptic Segmentation Segmentation Semantic Segmentation Instance Segmentation Image Segmentation

Paper PDF Code Code Code Code Code Code(official)Code

Abstract

Image segmentation is about grouping pixels with different semantics, e.g., category or instance membership, where each choice of semantics defines a task. While only the semantics of each task differ, current research focuses on designing specialized architectures for each task. We present Masked-attention Mask Transformer (Mask2Former), a new architecture capable of addressing any image segmentation task (panoptic, instance or semantic). Its key components include masked attention, which extracts localized features by constraining cross-attention within predicted mask regions. In addition to reducing the research effort by at least three times, it outperforms the best specialized architectures by a significant margin on four popular datasets. Most notably, Mask2Former sets a new state-of-the-art for panoptic segmentation (57.8 PQ on COCO), instance segmentation (50.1 AP on COCO) and semantic segmentation (57.7 mIoU on ADE20K).

Results

Task	Dataset	Metric	Value	Model
Semantic Segmentation	COCO (Common Objects in Context)	mIoU	67.4	Mask2Former (Swin-L, single-scale)
Semantic Segmentation	COCO (Common Objects in Context)	mIoU	64.8	MaskFormer (Swin-L, single-scale)
Semantic Segmentation	Mapillary val	mIoU	64.7	Mask2Former (Swin-L, multiscale)
Semantic Segmentation	Fine-Grained Grass Segmentation Dataset	mIoU	44.93	Mask2Former
Semantic Segmentation	Cityscapes val	mIoU	84.3	Mask2Former (Swin-L)
Semantic Segmentation	ADE20K val	mIoU	57.7	Mask2Former (Swin-L-FaPN, multiscale)
Semantic Segmentation	ADE20K val	mIoU	56.4	Mask2Former (Swin-L-FaPN)
Semantic Segmentation	ADE20K	Validation mIoU	57.7	Mask2Former (SwinL-FaPN)
Semantic Segmentation	ADE20K	Validation mIoU	57.3	Mask2Former (SwinL)
Semantic Segmentation	ADE20K	Validation mIoU	56.4	Mask2Former (Swin-L-FaPN)
Semantic Segmentation	ADE20K	Validation mIoU	55.1	Mask2Former(Swin-B)
Semantic Segmentation	Cityscapes val	AP	43.6	Mask2Former (Swin-L)
Semantic Segmentation	Cityscapes val	PQ	66.6	Mask2Former (Swin-L)
Semantic Segmentation	Cityscapes val	mIoU	82.9	Mask2Former (Swin-L)
Semantic Segmentation	COCO test-dev	PQ	58.3	Mask2Former (Swin-L)
Semantic Segmentation	COCO test-dev	PQst	48.1	Mask2Former (Swin-L)
Semantic Segmentation	COCO test-dev	PQth	65.1	Mask2Former (Swin-L)
Semantic Segmentation	ADE20K val	AP	34.2	Mask2Former (Swin-L)
Semantic Segmentation	ADE20K val	PQ	48.1	Mask2Former (Swin-L)
Semantic Segmentation	ADE20K val	mIoU	54.5	Mask2Former (Swin-L)
Semantic Segmentation	ADE20K val	AP	33.2	Mask2Former (Swin-L + FAPN, 640x640)
Semantic Segmentation	ADE20K val	PQ	46.2	Mask2Former (Swin-L + FAPN, 640x640)
Semantic Segmentation	ADE20K val	mIoU	55.4	Mask2Former (Swin-L + FAPN, 640x640)
Semantic Segmentation	ADE20K val	PQ	39.7	Mask2Former (ResNet-50, 640x640)
Semantic Segmentation	ADE20K val	PQ	37.9	Panoptic-DeepLab (SwideRNet)
Semantic Segmentation	ADE20K val	mIoU	50	Panoptic-DeepLab (SwideRNet)
Semantic Segmentation	ADE20K val	AP	26.5	Mask2Former (ResNet-50, 640x640)
Semantic Segmentation	ADE20K val	mIoU	46.1	Mask2Former (ResNet-50, 640x640)
Semantic Segmentation	COCO minival	AP	48.6	Mask2Former (single-scale)
Semantic Segmentation	COCO minival	PQ	57.8	Mask2Former (single-scale)
Semantic Segmentation	COCO minival	PQst	48.1	Mask2Former (single-scale)
Semantic Segmentation	COCO minival	PQth	64.2	Mask2Former (single-scale)
Instance Segmentation	COCO minival	mask AP	50.1	Mask2Former (Swin-L)
Instance Segmentation	Cityscapes val	mask AP	43.7	Mask2Former (Swin-L, single-scale)
Instance Segmentation	Cityscapes val	mask AP	42	Mask2Former (Swin-B)
Instance Segmentation	Cityscapes val	mask AP	41.8	Mask2Former (Swin-S)
Instance Segmentation	Cityscapes val	mask AP	39.7	Mask2Former (Swin-T)
Instance Segmentation	Cityscapes val	mask AP	38.5	Mask2Former (ResNet-101)
Instance Segmentation	Cityscapes val	mask AP	37.4	Mask2Former (ResNet-50)
Instance Segmentation	COCO val (panoptic labels)	AP	49.1	Mask2Former (Swin-L, single-scale)
Instance Segmentation	COCO test-dev	AP50	74.9	Mask2Former (Swin-L, single scale)
Instance Segmentation	COCO test-dev	AP75	54.9	Mask2Former (Swin-L, single scale)
Instance Segmentation	COCO test-dev	APL	71.2	Mask2Former (Swin-L, single scale)
Instance Segmentation	COCO test-dev	APM	53.8	Mask2Former (Swin-L, single scale)
Instance Segmentation	COCO test-dev	APS	29.1	Mask2Former (Swin-L, single scale)
Instance Segmentation	COCO test-dev	mask AP	50.5	Mask2Former (Swin-L, single scale)
Instance Segmentation	ADE20K val	AP	34.9	Mask2Former (Swin-L, single-scale)
Instance Segmentation	ADE20K val	APL	54.7	Mask2Former (Swin-L, single-scale)
Instance Segmentation	ADE20K val	APM	40	Mask2Former (Swin-L, single-scale)
Instance Segmentation	ADE20K val	APS	16.3	Mask2Former (Swin-L, single-scale)
Instance Segmentation	ADE20K val	AP	33.4	Mask2Former (Swin-L + FAPN)
Instance Segmentation	ADE20K val	APL	54.6	Mask2Former (Swin-L + FAPN)
Instance Segmentation	ADE20K val	APM	37.6	Mask2Former (Swin-L + FAPN)
Instance Segmentation	ADE20K val	APS	14.6	Mask2Former (Swin-L + FAPN)
Instance Segmentation	ADE20K val	AP	26.4	Mask2Former (ResNet50)
Instance Segmentation	ADE20K val	APS	10.4	Mask2Former (ResNet50)
Instance Segmentation	ADE20K val	APL	43.1	Mask2Former (ResNet-50)
Instance Segmentation	ADE20K val	APM	28.9	Mask2Former (ResNet-50)
2D Semantic Segmentation	WildScenes	mIoU	47.85	Mask2Former (Swin-L)
2D Semantic Segmentation	WildScenes	mIoU	43.71	Mask2Former (ResNet-50)
10-shot image generation	COCO (Common Objects in Context)	mIoU	67.4	Mask2Former (Swin-L, single-scale)
10-shot image generation	COCO (Common Objects in Context)	mIoU	64.8	MaskFormer (Swin-L, single-scale)
10-shot image generation	Mapillary val	mIoU	64.7	Mask2Former (Swin-L, multiscale)
10-shot image generation	Fine-Grained Grass Segmentation Dataset	mIoU	44.93	Mask2Former
10-shot image generation	Cityscapes val	mIoU	84.3	Mask2Former (Swin-L)
10-shot image generation	ADE20K val	mIoU	57.7	Mask2Former (Swin-L-FaPN, multiscale)
10-shot image generation	ADE20K val	mIoU	56.4	Mask2Former (Swin-L-FaPN)
10-shot image generation	ADE20K	Validation mIoU	57.7	Mask2Former (SwinL-FaPN)
10-shot image generation	ADE20K	Validation mIoU	57.3	Mask2Former (SwinL)
10-shot image generation	ADE20K	Validation mIoU	56.4	Mask2Former (Swin-L-FaPN)
10-shot image generation	ADE20K	Validation mIoU	55.1	Mask2Former(Swin-B)
10-shot image generation	Cityscapes val	AP	43.6	Mask2Former (Swin-L)
10-shot image generation	Cityscapes val	PQ	66.6	Mask2Former (Swin-L)
10-shot image generation	Cityscapes val	mIoU	82.9	Mask2Former (Swin-L)
10-shot image generation	COCO test-dev	PQ	58.3	Mask2Former (Swin-L)
10-shot image generation	COCO test-dev	PQst	48.1	Mask2Former (Swin-L)
10-shot image generation	COCO test-dev	PQth	65.1	Mask2Former (Swin-L)
10-shot image generation	ADE20K val	AP	34.2	Mask2Former (Swin-L)
10-shot image generation	ADE20K val	PQ	48.1	Mask2Former (Swin-L)
10-shot image generation	ADE20K val	mIoU	54.5	Mask2Former (Swin-L)
10-shot image generation	ADE20K val	AP	33.2	Mask2Former (Swin-L + FAPN, 640x640)
10-shot image generation	ADE20K val	PQ	46.2	Mask2Former (Swin-L + FAPN, 640x640)
10-shot image generation	ADE20K val	mIoU	55.4	Mask2Former (Swin-L + FAPN, 640x640)
10-shot image generation	ADE20K val	PQ	39.7	Mask2Former (ResNet-50, 640x640)
10-shot image generation	ADE20K val	PQ	37.9	Panoptic-DeepLab (SwideRNet)
10-shot image generation	ADE20K val	mIoU	50	Panoptic-DeepLab (SwideRNet)
10-shot image generation	ADE20K val	AP	26.5	Mask2Former (ResNet-50, 640x640)
10-shot image generation	ADE20K val	mIoU	46.1	Mask2Former (ResNet-50, 640x640)
10-shot image generation	COCO minival	AP	48.6	Mask2Former (single-scale)
10-shot image generation	COCO minival	PQ	57.8	Mask2Former (single-scale)
10-shot image generation	COCO minival	PQst	48.1	Mask2Former (single-scale)
10-shot image generation	COCO minival	PQth	64.2	Mask2Former (single-scale)
Panoptic Segmentation	Cityscapes val	AP	43.6	Mask2Former (Swin-L)
Panoptic Segmentation	Cityscapes val	PQ	66.6	Mask2Former (Swin-L)
Panoptic Segmentation	Cityscapes val	mIoU	82.9	Mask2Former (Swin-L)
Panoptic Segmentation	COCO test-dev	PQ	58.3	Mask2Former (Swin-L)
Panoptic Segmentation	COCO test-dev	PQst	48.1	Mask2Former (Swin-L)
Panoptic Segmentation	COCO test-dev	PQth	65.1	Mask2Former (Swin-L)
Panoptic Segmentation	ADE20K val	AP	34.2	Mask2Former (Swin-L)
Panoptic Segmentation	ADE20K val	PQ	48.1	Mask2Former (Swin-L)
Panoptic Segmentation	ADE20K val	mIoU	54.5	Mask2Former (Swin-L)
Panoptic Segmentation	ADE20K val	AP	33.2	Mask2Former (Swin-L + FAPN, 640x640)
Panoptic Segmentation	ADE20K val	PQ	46.2	Mask2Former (Swin-L + FAPN, 640x640)
Panoptic Segmentation	ADE20K val	mIoU	55.4	Mask2Former (Swin-L + FAPN, 640x640)
Panoptic Segmentation	ADE20K val	PQ	39.7	Mask2Former (ResNet-50, 640x640)
Panoptic Segmentation	ADE20K val	PQ	37.9	Panoptic-DeepLab (SwideRNet)
Panoptic Segmentation	ADE20K val	mIoU	50	Panoptic-DeepLab (SwideRNet)
Panoptic Segmentation	ADE20K val	AP	26.5	Mask2Former (ResNet-50, 640x640)
Panoptic Segmentation	ADE20K val	mIoU	46.1	Mask2Former (ResNet-50, 640x640)
Panoptic Segmentation	COCO minival	AP	48.6	Mask2Former (single-scale)
Panoptic Segmentation	COCO minival	PQ	57.8	Mask2Former (single-scale)
Panoptic Segmentation	COCO minival	PQst	48.1	Mask2Former (single-scale)
Panoptic Segmentation	COCO minival	PQth	64.2	Mask2Former (single-scale)

Abstract

Results

Task	Dataset	Metric	Value	Model
Semantic Segmentation	COCO (Common Objects in Context)	mIoU	67.4	Mask2Former (Swin-L, single-scale)
Semantic Segmentation	COCO (Common Objects in Context)	mIoU	64.8	MaskFormer (Swin-L, single-scale)
Semantic Segmentation	Mapillary val	mIoU	64.7	Mask2Former (Swin-L, multiscale)
Semantic Segmentation	Fine-Grained Grass Segmentation Dataset	mIoU	44.93	Mask2Former
Semantic Segmentation	Cityscapes val	mIoU	84.3	Mask2Former (Swin-L)
Semantic Segmentation	ADE20K val	mIoU	57.7	Mask2Former (Swin-L-FaPN, multiscale)
Semantic Segmentation	ADE20K val	mIoU	56.4	Mask2Former (Swin-L-FaPN)
Semantic Segmentation	ADE20K	Validation mIoU	57.7	Mask2Former (SwinL-FaPN)
Semantic Segmentation	ADE20K	Validation mIoU	57.3	Mask2Former (SwinL)
Semantic Segmentation	ADE20K	Validation mIoU	56.4	Mask2Former (Swin-L-FaPN)
Semantic Segmentation	ADE20K	Validation mIoU	55.1	Mask2Former(Swin-B)
Semantic Segmentation	Cityscapes val	AP	43.6	Mask2Former (Swin-L)
Semantic Segmentation	Cityscapes val	PQ	66.6	Mask2Former (Swin-L)
Semantic Segmentation	Cityscapes val	mIoU	82.9	Mask2Former (Swin-L)
Semantic Segmentation	COCO test-dev	PQ	58.3	Mask2Former (Swin-L)
Semantic Segmentation	COCO test-dev	PQst	48.1	Mask2Former (Swin-L)
Semantic Segmentation	COCO test-dev	PQth	65.1	Mask2Former (Swin-L)
Semantic Segmentation	ADE20K val	AP	34.2	Mask2Former (Swin-L)
Semantic Segmentation	ADE20K val	PQ	48.1	Mask2Former (Swin-L)
Semantic Segmentation	ADE20K val	mIoU	54.5	Mask2Former (Swin-L)
Semantic Segmentation	ADE20K val	AP	33.2	Mask2Former (Swin-L + FAPN, 640x640)
Semantic Segmentation	ADE20K val	PQ	46.2	Mask2Former (Swin-L + FAPN, 640x640)
Semantic Segmentation	ADE20K val	mIoU	55.4	Mask2Former (Swin-L + FAPN, 640x640)
Semantic Segmentation	ADE20K val	PQ	39.7	Mask2Former (ResNet-50, 640x640)
Semantic Segmentation	ADE20K val	PQ	37.9	Panoptic-DeepLab (SwideRNet)
Semantic Segmentation	ADE20K val	mIoU	50	Panoptic-DeepLab (SwideRNet)
Semantic Segmentation	ADE20K val	AP	26.5	Mask2Former (ResNet-50, 640x640)
Semantic Segmentation	ADE20K val	mIoU	46.1	Mask2Former (ResNet-50, 640x640)
Semantic Segmentation	COCO minival	AP	48.6	Mask2Former (single-scale)
Semantic Segmentation	COCO minival	PQ	57.8	Mask2Former (single-scale)
Semantic Segmentation	COCO minival	PQst	48.1	Mask2Former (single-scale)
Semantic Segmentation	COCO minival	PQth	64.2	Mask2Former (single-scale)
Instance Segmentation	COCO minival	mask AP	50.1	Mask2Former (Swin-L)
Instance Segmentation	Cityscapes val	mask AP	43.7	Mask2Former (Swin-L, single-scale)
Instance Segmentation	Cityscapes val	mask AP	42	Mask2Former (Swin-B)
Instance Segmentation	Cityscapes val	mask AP	41.8	Mask2Former (Swin-S)
Instance Segmentation	Cityscapes val	mask AP	39.7	Mask2Former (Swin-T)
Instance Segmentation	Cityscapes val	mask AP	38.5	Mask2Former (ResNet-101)
Instance Segmentation	Cityscapes val	mask AP	37.4	Mask2Former (ResNet-50)
Instance Segmentation	COCO val (panoptic labels)	AP	49.1	Mask2Former (Swin-L, single-scale)
Instance Segmentation	COCO test-dev	AP50	74.9	Mask2Former (Swin-L, single scale)
Instance Segmentation	COCO test-dev	AP75	54.9	Mask2Former (Swin-L, single scale)
Instance Segmentation	COCO test-dev	APL	71.2	Mask2Former (Swin-L, single scale)
Instance Segmentation	COCO test-dev	APM	53.8	Mask2Former (Swin-L, single scale)
Instance Segmentation	COCO test-dev	APS	29.1	Mask2Former (Swin-L, single scale)
Instance Segmentation	COCO test-dev	mask AP	50.5	Mask2Former (Swin-L, single scale)
Instance Segmentation	ADE20K val	AP	34.9	Mask2Former (Swin-L, single-scale)
Instance Segmentation	ADE20K val	APL	54.7	Mask2Former (Swin-L, single-scale)
Instance Segmentation	ADE20K val	APM	40	Mask2Former (Swin-L, single-scale)
Instance Segmentation	ADE20K val	APS	16.3	Mask2Former (Swin-L, single-scale)
Instance Segmentation	ADE20K val	AP	33.4	Mask2Former (Swin-L + FAPN)
Instance Segmentation	ADE20K val	APL	54.6	Mask2Former (Swin-L + FAPN)
Instance Segmentation	ADE20K val	APM	37.6	Mask2Former (Swin-L + FAPN)
Instance Segmentation	ADE20K val	APS	14.6	Mask2Former (Swin-L + FAPN)
Instance Segmentation	ADE20K val	AP	26.4	Mask2Former (ResNet50)
Instance Segmentation	ADE20K val	APS	10.4	Mask2Former (ResNet50)
Instance Segmentation	ADE20K val	APL	43.1	Mask2Former (ResNet-50)
Instance Segmentation	ADE20K val	APM	28.9	Mask2Former (ResNet-50)
2D Semantic Segmentation	WildScenes	mIoU	47.85	Mask2Former (Swin-L)
2D Semantic Segmentation	WildScenes	mIoU	43.71	Mask2Former (ResNet-50)
10-shot image generation	COCO (Common Objects in Context)	mIoU	67.4	Mask2Former (Swin-L, single-scale)
10-shot image generation	COCO (Common Objects in Context)	mIoU	64.8	MaskFormer (Swin-L, single-scale)
10-shot image generation	Mapillary val	mIoU	64.7	Mask2Former (Swin-L, multiscale)
10-shot image generation	Fine-Grained Grass Segmentation Dataset	mIoU	44.93	Mask2Former
10-shot image generation	Cityscapes val	mIoU	84.3	Mask2Former (Swin-L)
10-shot image generation	ADE20K val	mIoU	57.7	Mask2Former (Swin-L-FaPN, multiscale)
10-shot image generation	ADE20K val	mIoU	56.4	Mask2Former (Swin-L-FaPN)
10-shot image generation	ADE20K	Validation mIoU	57.7	Mask2Former (SwinL-FaPN)
10-shot image generation	ADE20K	Validation mIoU	57.3	Mask2Former (SwinL)
10-shot image generation	ADE20K	Validation mIoU	56.4	Mask2Former (Swin-L-FaPN)
10-shot image generation	ADE20K	Validation mIoU	55.1	Mask2Former(Swin-B)
10-shot image generation	Cityscapes val	AP	43.6	Mask2Former (Swin-L)
10-shot image generation	Cityscapes val	PQ	66.6	Mask2Former (Swin-L)
10-shot image generation	Cityscapes val	mIoU	82.9	Mask2Former (Swin-L)
10-shot image generation	COCO test-dev	PQ	58.3	Mask2Former (Swin-L)
10-shot image generation	COCO test-dev	PQst	48.1	Mask2Former (Swin-L)
10-shot image generation	COCO test-dev	PQth	65.1	Mask2Former (Swin-L)
10-shot image generation	ADE20K val	AP	34.2	Mask2Former (Swin-L)
10-shot image generation	ADE20K val	PQ	48.1	Mask2Former (Swin-L)
10-shot image generation	ADE20K val	mIoU	54.5	Mask2Former (Swin-L)
10-shot image generation	ADE20K val	AP	33.2	Mask2Former (Swin-L + FAPN, 640x640)
10-shot image generation	ADE20K val	PQ	46.2	Mask2Former (Swin-L + FAPN, 640x640)
10-shot image generation	ADE20K val	mIoU	55.4	Mask2Former (Swin-L + FAPN, 640x640)
10-shot image generation	ADE20K val	PQ	39.7	Mask2Former (ResNet-50, 640x640)
10-shot image generation	ADE20K val	PQ	37.9	Panoptic-DeepLab (SwideRNet)
10-shot image generation	ADE20K val	mIoU	50	Panoptic-DeepLab (SwideRNet)
10-shot image generation	ADE20K val	AP	26.5	Mask2Former (ResNet-50, 640x640)
10-shot image generation	ADE20K val	mIoU	46.1	Mask2Former (ResNet-50, 640x640)
10-shot image generation	COCO minival	AP	48.6	Mask2Former (single-scale)
10-shot image generation	COCO minival	PQ	57.8	Mask2Former (single-scale)
10-shot image generation	COCO minival	PQst	48.1	Mask2Former (single-scale)
10-shot image generation	COCO minival	PQth	64.2	Mask2Former (single-scale)
Panoptic Segmentation	Cityscapes val	AP	43.6	Mask2Former (Swin-L)
Panoptic Segmentation	Cityscapes val	PQ	66.6	Mask2Former (Swin-L)
Panoptic Segmentation	Cityscapes val	mIoU	82.9	Mask2Former (Swin-L)
Panoptic Segmentation	COCO test-dev	PQ	58.3	Mask2Former (Swin-L)
Panoptic Segmentation	COCO test-dev	PQst	48.1	Mask2Former (Swin-L)
Panoptic Segmentation	COCO test-dev	PQth	65.1	Mask2Former (Swin-L)
Panoptic Segmentation	ADE20K val	AP	34.2	Mask2Former (Swin-L)
Panoptic Segmentation	ADE20K val	PQ	48.1	Mask2Former (Swin-L)
Panoptic Segmentation	ADE20K val	mIoU	54.5	Mask2Former (Swin-L)
Panoptic Segmentation	ADE20K val	AP	33.2	Mask2Former (Swin-L + FAPN, 640x640)
Panoptic Segmentation	ADE20K val	PQ	46.2	Mask2Former (Swin-L + FAPN, 640x640)
Panoptic Segmentation	ADE20K val	mIoU	55.4	Mask2Former (Swin-L + FAPN, 640x640)
Panoptic Segmentation	ADE20K val	PQ	39.7	Mask2Former (ResNet-50, 640x640)
Panoptic Segmentation	ADE20K val	PQ	37.9	Panoptic-DeepLab (SwideRNet)
Panoptic Segmentation	ADE20K val	mIoU	50	Panoptic-DeepLab (SwideRNet)
Panoptic Segmentation	ADE20K val	AP	26.5	Mask2Former (ResNet-50, 640x640)
Panoptic Segmentation	ADE20K val	mIoU	46.1	Mask2Former (ResNet-50, 640x640)
Panoptic Segmentation	COCO minival	AP	48.6	Mask2Former (single-scale)
Panoptic Segmentation	COCO minival	PQ	57.8	Mask2Former (single-scale)
Panoptic Segmentation	COCO minival	PQst	48.1	Mask2Former (single-scale)
Panoptic Segmentation	COCO minival	PQth	64.2	Mask2Former (single-scale)

Masked-attention Mask Transformer for Universal Image Segmentation

Abstract

Results

Related Papers

Masked-attention Mask Transformer for Universal Image Segmentation

Abstract

Results

Related Papers