Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexander Kirillov, Rohit Girdhar
Image segmentation is about grouping pixels with different semantics, e.g., category or instance membership, where each choice of semantics defines a task. While only the semantics of each task differ, current research focuses on designing specialized architectures for each task. We present Masked-attention Mask Transformer (Mask2Former), a new architecture capable of addressing any image segmentation task (panoptic, instance or semantic). Its key components include masked attention, which extracts localized features by constraining cross-attention within predicted mask regions. In addition to reducing the research effort by at least three times, it outperforms the best specialized architectures by a significant margin on four popular datasets. Most notably, Mask2Former sets a new state-of-the-art for panoptic segmentation (57.8 PQ on COCO), instance segmentation (50.1 AP on COCO) and semantic segmentation (57.7 mIoU on ADE20K).
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Semantic Segmentation | COCO (Common Objects in Context) | mIoU | 67.4 | Mask2Former (Swin-L, single-scale) |
| Semantic Segmentation | COCO (Common Objects in Context) | mIoU | 64.8 | MaskFormer (Swin-L, single-scale) |
| Semantic Segmentation | Mapillary val | mIoU | 64.7 | Mask2Former (Swin-L, multiscale) |
| Semantic Segmentation | Fine-Grained Grass Segmentation Dataset | mIoU | 44.93 | Mask2Former |
| Semantic Segmentation | Cityscapes val | mIoU | 84.3 | Mask2Former (Swin-L) |
| Semantic Segmentation | ADE20K val | mIoU | 57.7 | Mask2Former (Swin-L-FaPN, multiscale) |
| Semantic Segmentation | ADE20K val | mIoU | 56.4 | Mask2Former (Swin-L-FaPN) |
| Semantic Segmentation | ADE20K | Validation mIoU | 57.7 | Mask2Former (SwinL-FaPN) |
| Semantic Segmentation | ADE20K | Validation mIoU | 57.3 | Mask2Former (SwinL) |
| Semantic Segmentation | ADE20K | Validation mIoU | 56.4 | Mask2Former (Swin-L-FaPN) |
| Semantic Segmentation | ADE20K | Validation mIoU | 55.1 | Mask2Former(Swin-B) |
| Semantic Segmentation | Cityscapes val | AP | 43.6 | Mask2Former (Swin-L) |
| Semantic Segmentation | Cityscapes val | PQ | 66.6 | Mask2Former (Swin-L) |
| Semantic Segmentation | Cityscapes val | mIoU | 82.9 | Mask2Former (Swin-L) |
| Semantic Segmentation | COCO test-dev | PQ | 58.3 | Mask2Former (Swin-L) |
| Semantic Segmentation | COCO test-dev | PQst | 48.1 | Mask2Former (Swin-L) |
| Semantic Segmentation | COCO test-dev | PQth | 65.1 | Mask2Former (Swin-L) |
| Semantic Segmentation | ADE20K val | AP | 34.2 | Mask2Former (Swin-L) |
| Semantic Segmentation | ADE20K val | PQ | 48.1 | Mask2Former (Swin-L) |
| Semantic Segmentation | ADE20K val | mIoU | 54.5 | Mask2Former (Swin-L) |
| Semantic Segmentation | ADE20K val | AP | 33.2 | Mask2Former (Swin-L + FAPN, 640x640) |
| Semantic Segmentation | ADE20K val | PQ | 46.2 | Mask2Former (Swin-L + FAPN, 640x640) |
| Semantic Segmentation | ADE20K val | mIoU | 55.4 | Mask2Former (Swin-L + FAPN, 640x640) |
| Semantic Segmentation | ADE20K val | PQ | 39.7 | Mask2Former (ResNet-50, 640x640) |
| Semantic Segmentation | ADE20K val | PQ | 37.9 | Panoptic-DeepLab (SwideRNet) |
| Semantic Segmentation | ADE20K val | mIoU | 50 | Panoptic-DeepLab (SwideRNet) |
| Semantic Segmentation | ADE20K val | AP | 26.5 | Mask2Former (ResNet-50, 640x640) |
| Semantic Segmentation | ADE20K val | mIoU | 46.1 | Mask2Former (ResNet-50, 640x640) |
| Semantic Segmentation | COCO minival | AP | 48.6 | Mask2Former (single-scale) |
| Semantic Segmentation | COCO minival | PQ | 57.8 | Mask2Former (single-scale) |
| Semantic Segmentation | COCO minival | PQst | 48.1 | Mask2Former (single-scale) |
| Semantic Segmentation | COCO minival | PQth | 64.2 | Mask2Former (single-scale) |
| Instance Segmentation | COCO minival | mask AP | 50.1 | Mask2Former (Swin-L) |
| Instance Segmentation | Cityscapes val | mask AP | 43.7 | Mask2Former (Swin-L, single-scale) |
| Instance Segmentation | Cityscapes val | mask AP | 42 | Mask2Former (Swin-B) |
| Instance Segmentation | Cityscapes val | mask AP | 41.8 | Mask2Former (Swin-S) |
| Instance Segmentation | Cityscapes val | mask AP | 39.7 | Mask2Former (Swin-T) |
| Instance Segmentation | Cityscapes val | mask AP | 38.5 | Mask2Former (ResNet-101) |
| Instance Segmentation | Cityscapes val | mask AP | 37.4 | Mask2Former (ResNet-50) |
| Instance Segmentation | COCO val (panoptic labels) | AP | 49.1 | Mask2Former (Swin-L, single-scale) |
| Instance Segmentation | COCO test-dev | AP50 | 74.9 | Mask2Former (Swin-L, single scale) |
| Instance Segmentation | COCO test-dev | AP75 | 54.9 | Mask2Former (Swin-L, single scale) |
| Instance Segmentation | COCO test-dev | APL | 71.2 | Mask2Former (Swin-L, single scale) |
| Instance Segmentation | COCO test-dev | APM | 53.8 | Mask2Former (Swin-L, single scale) |
| Instance Segmentation | COCO test-dev | APS | 29.1 | Mask2Former (Swin-L, single scale) |
| Instance Segmentation | COCO test-dev | mask AP | 50.5 | Mask2Former (Swin-L, single scale) |
| Instance Segmentation | ADE20K val | AP | 34.9 | Mask2Former (Swin-L, single-scale) |
| Instance Segmentation | ADE20K val | APL | 54.7 | Mask2Former (Swin-L, single-scale) |
| Instance Segmentation | ADE20K val | APM | 40 | Mask2Former (Swin-L, single-scale) |
| Instance Segmentation | ADE20K val | APS | 16.3 | Mask2Former (Swin-L, single-scale) |
| Instance Segmentation | ADE20K val | AP | 33.4 | Mask2Former (Swin-L + FAPN) |
| Instance Segmentation | ADE20K val | APL | 54.6 | Mask2Former (Swin-L + FAPN) |
| Instance Segmentation | ADE20K val | APM | 37.6 | Mask2Former (Swin-L + FAPN) |
| Instance Segmentation | ADE20K val | APS | 14.6 | Mask2Former (Swin-L + FAPN) |
| Instance Segmentation | ADE20K val | AP | 26.4 | Mask2Former (ResNet50) |
| Instance Segmentation | ADE20K val | APS | 10.4 | Mask2Former (ResNet50) |
| Instance Segmentation | ADE20K val | APL | 43.1 | Mask2Former (ResNet-50) |
| Instance Segmentation | ADE20K val | APM | 28.9 | Mask2Former (ResNet-50) |
| 2D Semantic Segmentation | WildScenes | mIoU | 47.85 | Mask2Former (Swin-L) |
| 2D Semantic Segmentation | WildScenes | mIoU | 43.71 | Mask2Former (ResNet-50) |
| 10-shot image generation | COCO (Common Objects in Context) | mIoU | 67.4 | Mask2Former (Swin-L, single-scale) |
| 10-shot image generation | COCO (Common Objects in Context) | mIoU | 64.8 | MaskFormer (Swin-L, single-scale) |
| 10-shot image generation | Mapillary val | mIoU | 64.7 | Mask2Former (Swin-L, multiscale) |
| 10-shot image generation | Fine-Grained Grass Segmentation Dataset | mIoU | 44.93 | Mask2Former |
| 10-shot image generation | Cityscapes val | mIoU | 84.3 | Mask2Former (Swin-L) |
| 10-shot image generation | ADE20K val | mIoU | 57.7 | Mask2Former (Swin-L-FaPN, multiscale) |
| 10-shot image generation | ADE20K val | mIoU | 56.4 | Mask2Former (Swin-L-FaPN) |
| 10-shot image generation | ADE20K | Validation mIoU | 57.7 | Mask2Former (SwinL-FaPN) |
| 10-shot image generation | ADE20K | Validation mIoU | 57.3 | Mask2Former (SwinL) |
| 10-shot image generation | ADE20K | Validation mIoU | 56.4 | Mask2Former (Swin-L-FaPN) |
| 10-shot image generation | ADE20K | Validation mIoU | 55.1 | Mask2Former(Swin-B) |
| 10-shot image generation | Cityscapes val | AP | 43.6 | Mask2Former (Swin-L) |
| 10-shot image generation | Cityscapes val | PQ | 66.6 | Mask2Former (Swin-L) |
| 10-shot image generation | Cityscapes val | mIoU | 82.9 | Mask2Former (Swin-L) |
| 10-shot image generation | COCO test-dev | PQ | 58.3 | Mask2Former (Swin-L) |
| 10-shot image generation | COCO test-dev | PQst | 48.1 | Mask2Former (Swin-L) |
| 10-shot image generation | COCO test-dev | PQth | 65.1 | Mask2Former (Swin-L) |
| 10-shot image generation | ADE20K val | AP | 34.2 | Mask2Former (Swin-L) |
| 10-shot image generation | ADE20K val | PQ | 48.1 | Mask2Former (Swin-L) |
| 10-shot image generation | ADE20K val | mIoU | 54.5 | Mask2Former (Swin-L) |
| 10-shot image generation | ADE20K val | AP | 33.2 | Mask2Former (Swin-L + FAPN, 640x640) |
| 10-shot image generation | ADE20K val | PQ | 46.2 | Mask2Former (Swin-L + FAPN, 640x640) |
| 10-shot image generation | ADE20K val | mIoU | 55.4 | Mask2Former (Swin-L + FAPN, 640x640) |
| 10-shot image generation | ADE20K val | PQ | 39.7 | Mask2Former (ResNet-50, 640x640) |
| 10-shot image generation | ADE20K val | PQ | 37.9 | Panoptic-DeepLab (SwideRNet) |
| 10-shot image generation | ADE20K val | mIoU | 50 | Panoptic-DeepLab (SwideRNet) |
| 10-shot image generation | ADE20K val | AP | 26.5 | Mask2Former (ResNet-50, 640x640) |
| 10-shot image generation | ADE20K val | mIoU | 46.1 | Mask2Former (ResNet-50, 640x640) |
| 10-shot image generation | COCO minival | AP | 48.6 | Mask2Former (single-scale) |
| 10-shot image generation | COCO minival | PQ | 57.8 | Mask2Former (single-scale) |
| 10-shot image generation | COCO minival | PQst | 48.1 | Mask2Former (single-scale) |
| 10-shot image generation | COCO minival | PQth | 64.2 | Mask2Former (single-scale) |
| Panoptic Segmentation | Cityscapes val | AP | 43.6 | Mask2Former (Swin-L) |
| Panoptic Segmentation | Cityscapes val | PQ | 66.6 | Mask2Former (Swin-L) |
| Panoptic Segmentation | Cityscapes val | mIoU | 82.9 | Mask2Former (Swin-L) |
| Panoptic Segmentation | COCO test-dev | PQ | 58.3 | Mask2Former (Swin-L) |
| Panoptic Segmentation | COCO test-dev | PQst | 48.1 | Mask2Former (Swin-L) |
| Panoptic Segmentation | COCO test-dev | PQth | 65.1 | Mask2Former (Swin-L) |
| Panoptic Segmentation | ADE20K val | AP | 34.2 | Mask2Former (Swin-L) |
| Panoptic Segmentation | ADE20K val | PQ | 48.1 | Mask2Former (Swin-L) |
| Panoptic Segmentation | ADE20K val | mIoU | 54.5 | Mask2Former (Swin-L) |
| Panoptic Segmentation | ADE20K val | AP | 33.2 | Mask2Former (Swin-L + FAPN, 640x640) |
| Panoptic Segmentation | ADE20K val | PQ | 46.2 | Mask2Former (Swin-L + FAPN, 640x640) |
| Panoptic Segmentation | ADE20K val | mIoU | 55.4 | Mask2Former (Swin-L + FAPN, 640x640) |
| Panoptic Segmentation | ADE20K val | PQ | 39.7 | Mask2Former (ResNet-50, 640x640) |
| Panoptic Segmentation | ADE20K val | PQ | 37.9 | Panoptic-DeepLab (SwideRNet) |
| Panoptic Segmentation | ADE20K val | mIoU | 50 | Panoptic-DeepLab (SwideRNet) |
| Panoptic Segmentation | ADE20K val | AP | 26.5 | Mask2Former (ResNet-50, 640x640) |
| Panoptic Segmentation | ADE20K val | mIoU | 46.1 | Mask2Former (ResNet-50, 640x640) |
| Panoptic Segmentation | COCO minival | AP | 48.6 | Mask2Former (single-scale) |
| Panoptic Segmentation | COCO minival | PQ | 57.8 | Mask2Former (single-scale) |
| Panoptic Segmentation | COCO minival | PQst | 48.1 | Mask2Former (single-scale) |
| Panoptic Segmentation | COCO minival | PQth | 64.2 | Mask2Former (single-scale) |