Bowen Cheng, Alexander G. Schwing, Alexander Kirillov
Modern approaches typically formulate semantic segmentation as a per-pixel classification task, while instance-level segmentation is handled with an alternative mask classification. Our key insight: mask classification is sufficiently general to solve both semantic- and instance-level segmentation tasks in a unified manner using the exact same model, loss, and training procedure. Following this observation, we propose MaskFormer, a simple mask classification model which predicts a set of binary masks, each associated with a single global class label prediction. Overall, the proposed mask classification-based method simplifies the landscape of effective approaches to semantic and panoptic segmentation tasks and shows excellent empirical results. In particular, we observe that MaskFormer outperforms per-pixel classification baselines when the number of classes is large. Our mask classification-based method outperforms both current state-of-the-art semantic (55.6 mIoU on ADE20K) and panoptic segmentation (52.7 PQ on COCO) models.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Semantic Segmentation | Mapillary val | mIoU | 55.4 | MaskFormer (ResNet-50) |
| Semantic Segmentation | ADE20K val | mIoU | 55.6 | MaskFormer (Swin-L, ImageNet-22k pretrain) |
| Semantic Segmentation | ADE20K | Validation mIoU | 53.8 | MaskFormer(Swin-B) |
| Semantic Segmentation | ADE20K | Validation mIoU | 48.1 | MaskFormer(ResNet-101) |
| Semantic Segmentation | COCO test-dev | PQ | 53.3 | MaskFormer (Swin-L) |
| Semantic Segmentation | COCO test-dev | PQst | 44.5 | MaskFormer (Swin-L) |
| Semantic Segmentation | COCO test-dev | PQth | 59.1 | MaskFormer (Swin-L) |
| Semantic Segmentation | ADE20K val | PQ | 35.7 | MaskFormer (R101 + 6 Enc) |
| Semantic Segmentation | COCO minival | PQ | 52.7 | MaskFormer (single-scale) |
| Semantic Segmentation | COCO minival | PQst | 44 | MaskFormer (single-scale) |
| Semantic Segmentation | COCO minival | PQth | 58.5 | MaskFormer (single-scale) |
| Semantic Segmentation | COCO minival | RQ | 63.5 | MaskFormer (single-scale) |
| Semantic Segmentation | COCO minival | SQ | 81.8 | MaskFormer (single-scale) |
| 10-shot image generation | Mapillary val | mIoU | 55.4 | MaskFormer (ResNet-50) |
| 10-shot image generation | ADE20K val | mIoU | 55.6 | MaskFormer (Swin-L, ImageNet-22k pretrain) |
| 10-shot image generation | ADE20K | Validation mIoU | 53.8 | MaskFormer(Swin-B) |
| 10-shot image generation | ADE20K | Validation mIoU | 48.1 | MaskFormer(ResNet-101) |
| 10-shot image generation | COCO test-dev | PQ | 53.3 | MaskFormer (Swin-L) |
| 10-shot image generation | COCO test-dev | PQst | 44.5 | MaskFormer (Swin-L) |
| 10-shot image generation | COCO test-dev | PQth | 59.1 | MaskFormer (Swin-L) |
| 10-shot image generation | ADE20K val | PQ | 35.7 | MaskFormer (R101 + 6 Enc) |
| 10-shot image generation | COCO minival | PQ | 52.7 | MaskFormer (single-scale) |
| 10-shot image generation | COCO minival | PQst | 44 | MaskFormer (single-scale) |
| 10-shot image generation | COCO minival | PQth | 58.5 | MaskFormer (single-scale) |
| 10-shot image generation | COCO minival | RQ | 63.5 | MaskFormer (single-scale) |
| 10-shot image generation | COCO minival | SQ | 81.8 | MaskFormer (single-scale) |
| Panoptic Segmentation | COCO test-dev | PQ | 53.3 | MaskFormer (Swin-L) |
| Panoptic Segmentation | COCO test-dev | PQst | 44.5 | MaskFormer (Swin-L) |
| Panoptic Segmentation | COCO test-dev | PQth | 59.1 | MaskFormer (Swin-L) |
| Panoptic Segmentation | ADE20K val | PQ | 35.7 | MaskFormer (R101 + 6 Enc) |
| Panoptic Segmentation | COCO minival | PQ | 52.7 | MaskFormer (single-scale) |
| Panoptic Segmentation | COCO minival | PQst | 44 | MaskFormer (single-scale) |
| Panoptic Segmentation | COCO minival | PQth | 58.5 | MaskFormer (single-scale) |
| Panoptic Segmentation | COCO minival | RQ | 63.5 | MaskFormer (single-scale) |
| Panoptic Segmentation | COCO minival | SQ | 81.8 | MaskFormer (single-scale) |