TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Masked-attention Mask Transformer for Universal Image Segm...

Masked-attention Mask Transformer for Universal Image Segmentation

Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexander Kirillov, Rohit Girdhar

2021-12-02CVPR 2022 12D Semantic SegmentationPanoptic SegmentationSegmentationSemantic SegmentationInstance SegmentationImage Segmentation
PaperPDFCodeCodeCodeCodeCodeCode(official)Code

Abstract

Image segmentation is about grouping pixels with different semantics, e.g., category or instance membership, where each choice of semantics defines a task. While only the semantics of each task differ, current research focuses on designing specialized architectures for each task. We present Masked-attention Mask Transformer (Mask2Former), a new architecture capable of addressing any image segmentation task (panoptic, instance or semantic). Its key components include masked attention, which extracts localized features by constraining cross-attention within predicted mask regions. In addition to reducing the research effort by at least three times, it outperforms the best specialized architectures by a significant margin on four popular datasets. Most notably, Mask2Former sets a new state-of-the-art for panoptic segmentation (57.8 PQ on COCO), instance segmentation (50.1 AP on COCO) and semantic segmentation (57.7 mIoU on ADE20K).

Results

TaskDatasetMetricValueModel
Semantic SegmentationCOCO (Common Objects in Context)mIoU67.4Mask2Former (Swin-L, single-scale)
Semantic SegmentationCOCO (Common Objects in Context)mIoU64.8MaskFormer (Swin-L, single-scale)
Semantic SegmentationMapillary valmIoU64.7Mask2Former (Swin-L, multiscale)
Semantic SegmentationFine-Grained Grass Segmentation DatasetmIoU44.93Mask2Former
Semantic SegmentationCityscapes valmIoU84.3Mask2Former (Swin-L)
Semantic SegmentationADE20K valmIoU57.7Mask2Former (Swin-L-FaPN, multiscale)
Semantic SegmentationADE20K valmIoU56.4Mask2Former (Swin-L-FaPN)
Semantic SegmentationADE20KValidation mIoU57.7Mask2Former (SwinL-FaPN)
Semantic SegmentationADE20KValidation mIoU57.3Mask2Former (SwinL)
Semantic SegmentationADE20KValidation mIoU56.4Mask2Former (Swin-L-FaPN)
Semantic SegmentationADE20KValidation mIoU55.1Mask2Former(Swin-B)
Semantic SegmentationCityscapes valAP43.6Mask2Former (Swin-L)
Semantic SegmentationCityscapes valPQ66.6Mask2Former (Swin-L)
Semantic SegmentationCityscapes valmIoU82.9Mask2Former (Swin-L)
Semantic SegmentationCOCO test-devPQ58.3Mask2Former (Swin-L)
Semantic SegmentationCOCO test-devPQst48.1Mask2Former (Swin-L)
Semantic SegmentationCOCO test-devPQth65.1Mask2Former (Swin-L)
Semantic SegmentationADE20K valAP34.2Mask2Former (Swin-L)
Semantic SegmentationADE20K valPQ48.1Mask2Former (Swin-L)
Semantic SegmentationADE20K valmIoU54.5Mask2Former (Swin-L)
Semantic SegmentationADE20K valAP33.2Mask2Former (Swin-L + FAPN, 640x640)
Semantic SegmentationADE20K valPQ46.2Mask2Former (Swin-L + FAPN, 640x640)
Semantic SegmentationADE20K valmIoU55.4Mask2Former (Swin-L + FAPN, 640x640)
Semantic SegmentationADE20K valPQ39.7Mask2Former (ResNet-50, 640x640)
Semantic SegmentationADE20K valPQ37.9Panoptic-DeepLab (SwideRNet)
Semantic SegmentationADE20K valmIoU50Panoptic-DeepLab (SwideRNet)
Semantic SegmentationADE20K valAP26.5Mask2Former (ResNet-50, 640x640)
Semantic SegmentationADE20K valmIoU46.1Mask2Former (ResNet-50, 640x640)
Semantic SegmentationCOCO minivalAP48.6Mask2Former (single-scale)
Semantic SegmentationCOCO minivalPQ57.8Mask2Former (single-scale)
Semantic SegmentationCOCO minivalPQst48.1Mask2Former (single-scale)
Semantic SegmentationCOCO minivalPQth64.2Mask2Former (single-scale)
Instance SegmentationCOCO minivalmask AP50.1Mask2Former (Swin-L)
Instance SegmentationCityscapes valmask AP43.7Mask2Former (Swin-L, single-scale)
Instance SegmentationCityscapes valmask AP42Mask2Former (Swin-B)
Instance SegmentationCityscapes valmask AP41.8Mask2Former (Swin-S)
Instance SegmentationCityscapes valmask AP39.7Mask2Former (Swin-T)
Instance SegmentationCityscapes valmask AP38.5Mask2Former (ResNet-101)
Instance SegmentationCityscapes valmask AP37.4Mask2Former (ResNet-50)
Instance SegmentationCOCO val (panoptic labels)AP49.1Mask2Former (Swin-L, single-scale)
Instance SegmentationCOCO test-devAP5074.9Mask2Former (Swin-L, single scale)
Instance SegmentationCOCO test-devAP7554.9Mask2Former (Swin-L, single scale)
Instance SegmentationCOCO test-devAPL71.2Mask2Former (Swin-L, single scale)
Instance SegmentationCOCO test-devAPM53.8Mask2Former (Swin-L, single scale)
Instance SegmentationCOCO test-devAPS29.1Mask2Former (Swin-L, single scale)
Instance SegmentationCOCO test-devmask AP50.5Mask2Former (Swin-L, single scale)
Instance SegmentationADE20K valAP34.9Mask2Former (Swin-L, single-scale)
Instance SegmentationADE20K valAPL54.7Mask2Former (Swin-L, single-scale)
Instance SegmentationADE20K valAPM40Mask2Former (Swin-L, single-scale)
Instance SegmentationADE20K valAPS16.3Mask2Former (Swin-L, single-scale)
Instance SegmentationADE20K valAP33.4Mask2Former (Swin-L + FAPN)
Instance SegmentationADE20K valAPL54.6Mask2Former (Swin-L + FAPN)
Instance SegmentationADE20K valAPM37.6Mask2Former (Swin-L + FAPN)
Instance SegmentationADE20K valAPS14.6Mask2Former (Swin-L + FAPN)
Instance SegmentationADE20K valAP26.4Mask2Former (ResNet50)
Instance SegmentationADE20K valAPS10.4Mask2Former (ResNet50)
Instance SegmentationADE20K valAPL43.1Mask2Former (ResNet-50)
Instance SegmentationADE20K valAPM28.9Mask2Former (ResNet-50)
2D Semantic SegmentationWildScenesmIoU47.85Mask2Former (Swin-L)
2D Semantic SegmentationWildScenesmIoU43.71Mask2Former (ResNet-50)
10-shot image generationCOCO (Common Objects in Context)mIoU67.4Mask2Former (Swin-L, single-scale)
10-shot image generationCOCO (Common Objects in Context)mIoU64.8MaskFormer (Swin-L, single-scale)
10-shot image generationMapillary valmIoU64.7Mask2Former (Swin-L, multiscale)
10-shot image generationFine-Grained Grass Segmentation DatasetmIoU44.93Mask2Former
10-shot image generationCityscapes valmIoU84.3Mask2Former (Swin-L)
10-shot image generationADE20K valmIoU57.7Mask2Former (Swin-L-FaPN, multiscale)
10-shot image generationADE20K valmIoU56.4Mask2Former (Swin-L-FaPN)
10-shot image generationADE20KValidation mIoU57.7Mask2Former (SwinL-FaPN)
10-shot image generationADE20KValidation mIoU57.3Mask2Former (SwinL)
10-shot image generationADE20KValidation mIoU56.4Mask2Former (Swin-L-FaPN)
10-shot image generationADE20KValidation mIoU55.1Mask2Former(Swin-B)
10-shot image generationCityscapes valAP43.6Mask2Former (Swin-L)
10-shot image generationCityscapes valPQ66.6Mask2Former (Swin-L)
10-shot image generationCityscapes valmIoU82.9Mask2Former (Swin-L)
10-shot image generationCOCO test-devPQ58.3Mask2Former (Swin-L)
10-shot image generationCOCO test-devPQst48.1Mask2Former (Swin-L)
10-shot image generationCOCO test-devPQth65.1Mask2Former (Swin-L)
10-shot image generationADE20K valAP34.2Mask2Former (Swin-L)
10-shot image generationADE20K valPQ48.1Mask2Former (Swin-L)
10-shot image generationADE20K valmIoU54.5Mask2Former (Swin-L)
10-shot image generationADE20K valAP33.2Mask2Former (Swin-L + FAPN, 640x640)
10-shot image generationADE20K valPQ46.2Mask2Former (Swin-L + FAPN, 640x640)
10-shot image generationADE20K valmIoU55.4Mask2Former (Swin-L + FAPN, 640x640)
10-shot image generationADE20K valPQ39.7Mask2Former (ResNet-50, 640x640)
10-shot image generationADE20K valPQ37.9Panoptic-DeepLab (SwideRNet)
10-shot image generationADE20K valmIoU50Panoptic-DeepLab (SwideRNet)
10-shot image generationADE20K valAP26.5Mask2Former (ResNet-50, 640x640)
10-shot image generationADE20K valmIoU46.1Mask2Former (ResNet-50, 640x640)
10-shot image generationCOCO minivalAP48.6Mask2Former (single-scale)
10-shot image generationCOCO minivalPQ57.8Mask2Former (single-scale)
10-shot image generationCOCO minivalPQst48.1Mask2Former (single-scale)
10-shot image generationCOCO minivalPQth64.2Mask2Former (single-scale)
Panoptic SegmentationCityscapes valAP43.6Mask2Former (Swin-L)
Panoptic SegmentationCityscapes valPQ66.6Mask2Former (Swin-L)
Panoptic SegmentationCityscapes valmIoU82.9Mask2Former (Swin-L)
Panoptic SegmentationCOCO test-devPQ58.3Mask2Former (Swin-L)
Panoptic SegmentationCOCO test-devPQst48.1Mask2Former (Swin-L)
Panoptic SegmentationCOCO test-devPQth65.1Mask2Former (Swin-L)
Panoptic SegmentationADE20K valAP34.2Mask2Former (Swin-L)
Panoptic SegmentationADE20K valPQ48.1Mask2Former (Swin-L)
Panoptic SegmentationADE20K valmIoU54.5Mask2Former (Swin-L)
Panoptic SegmentationADE20K valAP33.2Mask2Former (Swin-L + FAPN, 640x640)
Panoptic SegmentationADE20K valPQ46.2Mask2Former (Swin-L + FAPN, 640x640)
Panoptic SegmentationADE20K valmIoU55.4Mask2Former (Swin-L + FAPN, 640x640)
Panoptic SegmentationADE20K valPQ39.7Mask2Former (ResNet-50, 640x640)
Panoptic SegmentationADE20K valPQ37.9Panoptic-DeepLab (SwideRNet)
Panoptic SegmentationADE20K valmIoU50Panoptic-DeepLab (SwideRNet)
Panoptic SegmentationADE20K valAP26.5Mask2Former (ResNet-50, 640x640)
Panoptic SegmentationADE20K valmIoU46.1Mask2Former (ResNet-50, 640x640)
Panoptic SegmentationCOCO minivalAP48.6Mask2Former (single-scale)
Panoptic SegmentationCOCO minivalPQ57.8Mask2Former (single-scale)
Panoptic SegmentationCOCO minivalPQst48.1Mask2Former (single-scale)
Panoptic SegmentationCOCO minivalPQth64.2Mask2Former (single-scale)

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21Deep Learning-Based Fetal Lung Segmentation from Diffusion-weighted MRI Images and Lung Maturity Evaluation for Fetal Growth Restriction2025-07-17DiffOSeg: Omni Medical Image Segmentation via Multi-Expert Collaboration Diffusion Model2025-07-17From Variability To Accuracy: Conditional Bernoulli Diffusion Models with Consensus-Driven Correction for Thin Structure Segmentation2025-07-17Unleashing Vision Foundation Models for Coronary Artery Segmentation: Parallel ViT-CNN Encoding and Variational Fusion2025-07-17SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation2025-07-17Unified Medical Image Segmentation with State Space Modeling Snake2025-07-17A Privacy-Preserving Semantic-Segmentation Method Using Domain-Adaptation Technique2025-07-17