Kaiming He, Georgia Gkioxari, Piotr Dollár, Ross Girshick
We present a conceptually simple, flexible, and general framework for object instance segmentation. Our approach efficiently detects objects in an image while simultaneously generating a high-quality segmentation mask for each instance. The method, called Mask R-CNN, extends Faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition. Mask R-CNN is simple to train and adds only a small overhead to Faster R-CNN, running at 5 fps. Moreover, Mask R-CNN is easy to generalize to other tasks, e.g., allowing us to estimate human poses in the same framework. We show top results in all three tracks of the COCO suite of challenges, including instance segmentation, bounding-box object detection, and person keypoint detection. Without bells and whistles, Mask R-CNN outperforms all existing, single-model entries on every task, including the COCO 2016 challenge winners. We hope our simple and effective approach will serve as a solid baseline and help ease future research in instance-level recognition. Code has been made available at: https://github.com/facebookresearch/Detectron
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Medical Image Segmentation | Cell17 | Dice | 0.707 | Mask R-CNN |
| Medical Image Segmentation | Cell17 | F1-score | 0.8004 | Mask R-CNN |
| Medical Image Segmentation | Cell17 | Hausdorff | 12.6723 | Mask R-CNN |
| Semantic Segmentation | Cityscapes val | PQth | 54 | Mask R-CNN+COCO |
| Object Localization | GRIT | Localization (ablation) | 44.7 | Mask R-CNN |
| Object Localization | GRIT | Localization (test) | 45.1 | Mask R-CNN |
| Pose Estimation | COCO test-dev | AP | 63.1 | Mask-RCNN |
| Pose Estimation | COCO test-dev | AP50 | 87.3 | Mask-RCNN |
| Pose Estimation | COCO test-dev | AP75 | 68.7 | Mask-RCNN |
| Pose Estimation | COCO test-dev | APL | 71.4 | Mask-RCNN |
| Pose Estimation | COCO | Test AP | 63.1 | Mask R-CNN |
| Pose Estimation | COCO | Validation AP | 69.2 | Mask R-CNN |
| Pose Estimation | COCO test-dev | AP50 | 87.3 | Mask R-CNN |
| Pose Estimation | COCO test-dev | AP75 | 68.7 | Mask R-CNN |
| Pose Estimation | COCO test-dev | APL | 71.4 | Mask R-CNN |
| Pose Estimation | COCO test-dev | APM | 57.8 | Mask R-CNN |
| Pose Estimation | COCO test-challenge | AP | 68.9 | Mask R-CNN* |
| Pose Estimation | COCO test-challenge | AP50 | 89.2 | Mask R-CNN* |
| Pose Estimation | COCO test-challenge | AP75 | 75.2 | Mask R-CNN* |
| Pose Estimation | COCO test-challenge | APL | 82.6 | Mask R-CNN* |
| Pose Estimation | COCO test-challenge | AR | 75.4 | Mask R-CNN* |
| Pose Estimation | COCO test-challenge | AR50 | 93.2 | Mask R-CNN* |
| Pose Estimation | COCO test-challenge | AR75 | 81.2 | Mask R-CNN* |
| Pose Estimation | COCO test-challenge | ARL | 76.8 | Mask R-CNN* |
| Pose Estimation | COCO test-challenge | ARM | 70.2 | Mask R-CNN* |
| Pose Estimation | CrowdPose | AP Easy | 69.4 | Mask R-CNN |
| Pose Estimation | CrowdPose | AP Hard | 45.8 | Mask R-CNN |
| Pose Estimation | CrowdPose | AP Medium | 57.9 | Mask R-CNN |
| Pose Estimation | CrowdPose | mAP @0.5:0.95 | 57.2 | Mask R-CNN |
| Pose Estimation | OCHuman | AP50 | 33.2 | Mask R-CNN |
| Pose Estimation | OCHuman | AP75 | 24.5 | Mask R-CNN |
| Pose Estimation | OCHuman | Validation AP | 20.2 | Mask R-CNN |
| Object Detection | COCO test-dev | AP50 | 62.3 | Mask R-CNN (ResNeXt-101-FPN) |
| Object Detection | COCO test-dev | AP75 | 43.4 | Mask R-CNN (ResNeXt-101-FPN) |
| Object Detection | COCO test-dev | APL | 51.2 | Mask R-CNN (ResNeXt-101-FPN) |
| Object Detection | COCO test-dev | APM | 43.2 | Mask R-CNN (ResNeXt-101-FPN) |
| Object Detection | COCO test-dev | APS | 22.1 | Mask R-CNN (ResNeXt-101-FPN) |
| Object Detection | COCO test-dev | box mAP | 39.8 | Mask R-CNN (ResNeXt-101-FPN) |
| Object Detection | COCO test-dev | AP50 | 60.3 | Mask R-CNN (ResNet-101-FPN) |
| Object Detection | COCO test-dev | AP75 | 41.7 | Mask R-CNN (ResNet-101-FPN) |
| Object Detection | COCO test-dev | APL | 50.2 | Mask R-CNN (ResNet-101-FPN) |
| Object Detection | COCO test-dev | APM | 41.1 | Mask R-CNN (ResNet-101-FPN) |
| Object Detection | COCO test-dev | APS | 20.1 | Mask R-CNN (ResNet-101-FPN) |
| Object Detection | COCO test-dev | box mAP | 38.2 | Mask R-CNN (ResNet-101-FPN) |
| Object Detection | COCO-O | Average mAP | 17.1 | Mask R-CNN (ResNet-50) |
| Object Detection | COCO-O | Effective Robustness | -0.11 | Mask R-CNN (ResNet-50) |
| Object Detection | iSAID | Average Precision | 37.18 | Mask-RCNN+ |
| Object Detection | iSAID | Average Precision | 36.5 | Mask-RCNN |
| Object Detection | COCO minival | box AP | 40 | Mask R-CNN (ResNet-101-FPN) |
| Object Detection | COCO minival | box AP | 37.7 | Mask R-CNN (ResNet-50-FPN) |
| Object Detection | COCO minival | AP50 | 59.5 | Mask R-CNN (ResNeXt-101-FPN) |
| Object Detection | COCO minival | AP75 | 38.9 | Mask R-CNN (ResNeXt-101-FPN) |
| Object Detection | COCO minival | box AP | 36.7 | Mask R-CNN (ResNeXt-101-FPN) |
| Object Detection | COCO | box AP | 45.2 | Mask R-CNN X-152-32x8d |
| 3D | COCO test-dev | AP50 | 62.3 | Mask R-CNN (ResNeXt-101-FPN) |
| 3D | COCO test-dev | AP75 | 43.4 | Mask R-CNN (ResNeXt-101-FPN) |
| 3D | COCO test-dev | APL | 51.2 | Mask R-CNN (ResNeXt-101-FPN) |
| 3D | COCO test-dev | APM | 43.2 | Mask R-CNN (ResNeXt-101-FPN) |
| 3D | COCO test-dev | APS | 22.1 | Mask R-CNN (ResNeXt-101-FPN) |
| 3D | COCO test-dev | box mAP | 39.8 | Mask R-CNN (ResNeXt-101-FPN) |
| 3D | COCO test-dev | AP50 | 60.3 | Mask R-CNN (ResNet-101-FPN) |
| 3D | COCO test-dev | AP75 | 41.7 | Mask R-CNN (ResNet-101-FPN) |
| 3D | COCO test-dev | APL | 50.2 | Mask R-CNN (ResNet-101-FPN) |
| 3D | COCO test-dev | APM | 41.1 | Mask R-CNN (ResNet-101-FPN) |
| 3D | COCO test-dev | APS | 20.1 | Mask R-CNN (ResNet-101-FPN) |
| 3D | COCO test-dev | box mAP | 38.2 | Mask R-CNN (ResNet-101-FPN) |
| 3D | COCO-O | Average mAP | 17.1 | Mask R-CNN (ResNet-50) |
| 3D | COCO-O | Effective Robustness | -0.11 | Mask R-CNN (ResNet-50) |
| 3D | iSAID | Average Precision | 37.18 | Mask-RCNN+ |
| 3D | iSAID | Average Precision | 36.5 | Mask-RCNN |
| 3D | COCO minival | box AP | 40 | Mask R-CNN (ResNet-101-FPN) |
| 3D | COCO minival | box AP | 37.7 | Mask R-CNN (ResNet-50-FPN) |
| 3D | COCO minival | AP50 | 59.5 | Mask R-CNN (ResNeXt-101-FPN) |
| 3D | COCO minival | AP75 | 38.9 | Mask R-CNN (ResNeXt-101-FPN) |
| 3D | COCO minival | box AP | 36.7 | Mask R-CNN (ResNeXt-101-FPN) |
| 3D | COCO | box AP | 45.2 | Mask R-CNN X-152-32x8d |
| 3D | COCO test-dev | AP | 63.1 | Mask-RCNN |
| 3D | COCO test-dev | AP50 | 87.3 | Mask-RCNN |
| 3D | COCO test-dev | AP75 | 68.7 | Mask-RCNN |
| 3D | COCO test-dev | APL | 71.4 | Mask-RCNN |
| 3D | COCO | Test AP | 63.1 | Mask R-CNN |
| 3D | COCO | Validation AP | 69.2 | Mask R-CNN |
| 3D | COCO test-dev | AP50 | 87.3 | Mask R-CNN |
| 3D | COCO test-dev | AP75 | 68.7 | Mask R-CNN |
| 3D | COCO test-dev | APL | 71.4 | Mask R-CNN |
| 3D | COCO test-dev | APM | 57.8 | Mask R-CNN |
| 3D | COCO test-challenge | AP | 68.9 | Mask R-CNN* |
| 3D | COCO test-challenge | AP50 | 89.2 | Mask R-CNN* |
| 3D | COCO test-challenge | AP75 | 75.2 | Mask R-CNN* |
| 3D | COCO test-challenge | APL | 82.6 | Mask R-CNN* |
| 3D | COCO test-challenge | AR | 75.4 | Mask R-CNN* |
| 3D | COCO test-challenge | AR50 | 93.2 | Mask R-CNN* |
| 3D | COCO test-challenge | AR75 | 81.2 | Mask R-CNN* |
| 3D | COCO test-challenge | ARL | 76.8 | Mask R-CNN* |
| 3D | COCO test-challenge | ARM | 70.2 | Mask R-CNN* |
| 3D | CrowdPose | AP Easy | 69.4 | Mask R-CNN |
| 3D | CrowdPose | AP Hard | 45.8 | Mask R-CNN |
| 3D | CrowdPose | AP Medium | 57.9 | Mask R-CNN |
| 3D | CrowdPose | mAP @0.5:0.95 | 57.2 | Mask R-CNN |
| 3D | OCHuman | AP50 | 33.2 | Mask R-CNN |
| 3D | OCHuman | AP75 | 24.5 | Mask R-CNN |
| 3D | OCHuman | Validation AP | 20.2 | Mask R-CNN |
| Instance Segmentation | iSAID | Average Precision | 37.18 | Mask-RCNN+ |
| Instance Segmentation | iSAID | Average Precision | 36.5 | Mask-RCNN |
| Instance Segmentation | BDD100K val | AP | 20.5 | Mask R-CNN |
| Instance Segmentation | COCO test-dev | AP50 | 60 | Mask R-CNN (ResNeXt-101-FPN) |
| Instance Segmentation | COCO test-dev | AP75 | 39.4 | Mask R-CNN (ResNeXt-101-FPN) |
| Instance Segmentation | COCO test-dev | APL | 53.5 | Mask R-CNN (ResNeXt-101-FPN) |
| Instance Segmentation | COCO test-dev | APM | 39.9 | Mask R-CNN (ResNeXt-101-FPN) |
| Instance Segmentation | COCO test-dev | APS | 16.9 | Mask R-CNN (ResNeXt-101-FPN) |
| Instance Segmentation | COCO test-dev | mask AP | 37.1 | Mask R-CNN (ResNeXt-101-FPN) |
| Human Parsing | MHP v2.0 | AP 0.5 | 14.9 | Mask R-CNN |
| Multi-tissue Nucleus Segmentation | Kumar | Dice | 0.76 | Mask R-CNN (e) |
| Multi-tissue Nucleus Segmentation | Kumar | Hausdorff Distance (mm) | 50.9 | Mask R-CNN (e) |
| Object Segmentation | GRIT | Segmentation (ablation) | 26.2 | Mask R-CNN |
| Object Segmentation | GRIT | Segmentation (test) | 26.2 | Mask R-CNN |
| 2D Classification | COCO test-dev | AP50 | 62.3 | Mask R-CNN (ResNeXt-101-FPN) |
| 2D Classification | COCO test-dev | AP75 | 43.4 | Mask R-CNN (ResNeXt-101-FPN) |
| 2D Classification | COCO test-dev | APL | 51.2 | Mask R-CNN (ResNeXt-101-FPN) |
| 2D Classification | COCO test-dev | APM | 43.2 | Mask R-CNN (ResNeXt-101-FPN) |
| 2D Classification | COCO test-dev | APS | 22.1 | Mask R-CNN (ResNeXt-101-FPN) |
| 2D Classification | COCO test-dev | box mAP | 39.8 | Mask R-CNN (ResNeXt-101-FPN) |
| 2D Classification | COCO test-dev | AP50 | 60.3 | Mask R-CNN (ResNet-101-FPN) |
| 2D Classification | COCO test-dev | AP75 | 41.7 | Mask R-CNN (ResNet-101-FPN) |
| 2D Classification | COCO test-dev | APL | 50.2 | Mask R-CNN (ResNet-101-FPN) |
| 2D Classification | COCO test-dev | APM | 41.1 | Mask R-CNN (ResNet-101-FPN) |
| 2D Classification | COCO test-dev | APS | 20.1 | Mask R-CNN (ResNet-101-FPN) |
| 2D Classification | COCO test-dev | box mAP | 38.2 | Mask R-CNN (ResNet-101-FPN) |
| 2D Classification | COCO-O | Average mAP | 17.1 | Mask R-CNN (ResNet-50) |
| 2D Classification | COCO-O | Effective Robustness | -0.11 | Mask R-CNN (ResNet-50) |
| 2D Classification | iSAID | Average Precision | 37.18 | Mask-RCNN+ |
| 2D Classification | iSAID | Average Precision | 36.5 | Mask-RCNN |
| 2D Classification | COCO minival | box AP | 40 | Mask R-CNN (ResNet-101-FPN) |
| 2D Classification | COCO minival | box AP | 37.7 | Mask R-CNN (ResNet-50-FPN) |
| 2D Classification | COCO minival | AP50 | 59.5 | Mask R-CNN (ResNeXt-101-FPN) |
| 2D Classification | COCO minival | AP75 | 38.9 | Mask R-CNN (ResNeXt-101-FPN) |
| 2D Classification | COCO minival | box AP | 36.7 | Mask R-CNN (ResNeXt-101-FPN) |
| 2D Classification | COCO | box AP | 45.2 | Mask R-CNN X-152-32x8d |
| Multi-Person Pose Estimation | CrowdPose | AP Easy | 69.4 | Mask R-CNN |
| Multi-Person Pose Estimation | CrowdPose | AP Hard | 45.8 | Mask R-CNN |
| Multi-Person Pose Estimation | CrowdPose | AP Medium | 57.9 | Mask R-CNN |
| Multi-Person Pose Estimation | CrowdPose | mAP @0.5:0.95 | 57.2 | Mask R-CNN |
| Multi-Person Pose Estimation | OCHuman | AP50 | 33.2 | Mask R-CNN |
| Multi-Person Pose Estimation | OCHuman | AP75 | 24.5 | Mask R-CNN |
| Multi-Person Pose Estimation | OCHuman | Validation AP | 20.2 | Mask R-CNN |
| 2D Object Detection | COCO test-dev | AP50 | 62.3 | Mask R-CNN (ResNeXt-101-FPN) |
| 2D Object Detection | COCO test-dev | AP75 | 43.4 | Mask R-CNN (ResNeXt-101-FPN) |
| 2D Object Detection | COCO test-dev | APL | 51.2 | Mask R-CNN (ResNeXt-101-FPN) |
| 2D Object Detection | COCO test-dev | APM | 43.2 | Mask R-CNN (ResNeXt-101-FPN) |
| 2D Object Detection | COCO test-dev | APS | 22.1 | Mask R-CNN (ResNeXt-101-FPN) |
| 2D Object Detection | COCO test-dev | box mAP | 39.8 | Mask R-CNN (ResNeXt-101-FPN) |
| 2D Object Detection | COCO test-dev | AP50 | 60.3 | Mask R-CNN (ResNet-101-FPN) |
| 2D Object Detection | COCO test-dev | AP75 | 41.7 | Mask R-CNN (ResNet-101-FPN) |
| 2D Object Detection | COCO test-dev | APL | 50.2 | Mask R-CNN (ResNet-101-FPN) |
| 2D Object Detection | COCO test-dev | APM | 41.1 | Mask R-CNN (ResNet-101-FPN) |
| 2D Object Detection | COCO test-dev | APS | 20.1 | Mask R-CNN (ResNet-101-FPN) |
| 2D Object Detection | COCO test-dev | box mAP | 38.2 | Mask R-CNN (ResNet-101-FPN) |
| 2D Object Detection | COCO-O | Average mAP | 17.1 | Mask R-CNN (ResNet-50) |
| 2D Object Detection | COCO-O | Effective Robustness | -0.11 | Mask R-CNN (ResNet-50) |
| 2D Object Detection | iSAID | Average Precision | 37.18 | Mask-RCNN+ |
| 2D Object Detection | iSAID | Average Precision | 36.5 | Mask-RCNN |
| 2D Object Detection | COCO minival | box AP | 40 | Mask R-CNN (ResNet-101-FPN) |
| 2D Object Detection | COCO minival | box AP | 37.7 | Mask R-CNN (ResNet-50-FPN) |
| 2D Object Detection | COCO minival | AP50 | 59.5 | Mask R-CNN (ResNeXt-101-FPN) |
| 2D Object Detection | COCO minival | AP75 | 38.9 | Mask R-CNN (ResNeXt-101-FPN) |
| 2D Object Detection | COCO minival | box AP | 36.7 | Mask R-CNN (ResNeXt-101-FPN) |
| 2D Object Detection | COCO | box AP | 45.2 | Mask R-CNN X-152-32x8d |
| 10-shot image generation | Cityscapes val | PQth | 54 | Mask R-CNN+COCO |
| Panoptic Segmentation | Cityscapes val | PQth | 54 | Mask R-CNN+COCO |
| 1 Image, 2*2 Stitchi | COCO test-dev | AP | 63.1 | Mask-RCNN |
| 1 Image, 2*2 Stitchi | COCO test-dev | AP50 | 87.3 | Mask-RCNN |
| 1 Image, 2*2 Stitchi | COCO test-dev | AP75 | 68.7 | Mask-RCNN |
| 1 Image, 2*2 Stitchi | COCO test-dev | APL | 71.4 | Mask-RCNN |
| 1 Image, 2*2 Stitchi | COCO | Test AP | 63.1 | Mask R-CNN |
| 1 Image, 2*2 Stitchi | COCO | Validation AP | 69.2 | Mask R-CNN |
| 1 Image, 2*2 Stitchi | COCO test-dev | AP50 | 87.3 | Mask R-CNN |
| 1 Image, 2*2 Stitchi | COCO test-dev | AP75 | 68.7 | Mask R-CNN |
| 1 Image, 2*2 Stitchi | COCO test-dev | APL | 71.4 | Mask R-CNN |
| 1 Image, 2*2 Stitchi | COCO test-dev | APM | 57.8 | Mask R-CNN |
| 1 Image, 2*2 Stitchi | COCO test-challenge | AP | 68.9 | Mask R-CNN* |
| 1 Image, 2*2 Stitchi | COCO test-challenge | AP50 | 89.2 | Mask R-CNN* |
| 1 Image, 2*2 Stitchi | COCO test-challenge | AP75 | 75.2 | Mask R-CNN* |
| 1 Image, 2*2 Stitchi | COCO test-challenge | APL | 82.6 | Mask R-CNN* |
| 1 Image, 2*2 Stitchi | COCO test-challenge | AR | 75.4 | Mask R-CNN* |
| 1 Image, 2*2 Stitchi | COCO test-challenge | AR50 | 93.2 | Mask R-CNN* |
| 1 Image, 2*2 Stitchi | COCO test-challenge | AR75 | 81.2 | Mask R-CNN* |
| 1 Image, 2*2 Stitchi | COCO test-challenge | ARL | 76.8 | Mask R-CNN* |
| 1 Image, 2*2 Stitchi | COCO test-challenge | ARM | 70.2 | Mask R-CNN* |
| 1 Image, 2*2 Stitchi | CrowdPose | AP Easy | 69.4 | Mask R-CNN |
| 1 Image, 2*2 Stitchi | CrowdPose | AP Hard | 45.8 | Mask R-CNN |
| 1 Image, 2*2 Stitchi | CrowdPose | AP Medium | 57.9 | Mask R-CNN |
| 1 Image, 2*2 Stitchi | CrowdPose | mAP @0.5:0.95 | 57.2 | Mask R-CNN |
| 1 Image, 2*2 Stitchi | OCHuman | AP50 | 33.2 | Mask R-CNN |
| 1 Image, 2*2 Stitchi | OCHuman | AP75 | 24.5 | Mask R-CNN |
| 1 Image, 2*2 Stitchi | OCHuman | Validation AP | 20.2 | Mask R-CNN |
| 16k | COCO test-dev | AP50 | 62.3 | Mask R-CNN (ResNeXt-101-FPN) |
| 16k | COCO test-dev | AP75 | 43.4 | Mask R-CNN (ResNeXt-101-FPN) |
| 16k | COCO test-dev | APL | 51.2 | Mask R-CNN (ResNeXt-101-FPN) |
| 16k | COCO test-dev | APM | 43.2 | Mask R-CNN (ResNeXt-101-FPN) |
| 16k | COCO test-dev | APS | 22.1 | Mask R-CNN (ResNeXt-101-FPN) |
| 16k | COCO test-dev | box mAP | 39.8 | Mask R-CNN (ResNeXt-101-FPN) |
| 16k | COCO test-dev | AP50 | 60.3 | Mask R-CNN (ResNet-101-FPN) |
| 16k | COCO test-dev | AP75 | 41.7 | Mask R-CNN (ResNet-101-FPN) |
| 16k | COCO test-dev | APL | 50.2 | Mask R-CNN (ResNet-101-FPN) |
| 16k | COCO test-dev | APM | 41.1 | Mask R-CNN (ResNet-101-FPN) |
| 16k | COCO test-dev | APS | 20.1 | Mask R-CNN (ResNet-101-FPN) |
| 16k | COCO test-dev | box mAP | 38.2 | Mask R-CNN (ResNet-101-FPN) |
| 16k | COCO-O | Average mAP | 17.1 | Mask R-CNN (ResNet-50) |
| 16k | COCO-O | Effective Robustness | -0.11 | Mask R-CNN (ResNet-50) |
| 16k | iSAID | Average Precision | 37.18 | Mask-RCNN+ |
| 16k | iSAID | Average Precision | 36.5 | Mask-RCNN |
| 16k | COCO minival | box AP | 40 | Mask R-CNN (ResNet-101-FPN) |
| 16k | COCO minival | box AP | 37.7 | Mask R-CNN (ResNet-50-FPN) |
| 16k | COCO minival | AP50 | 59.5 | Mask R-CNN (ResNeXt-101-FPN) |
| 16k | COCO minival | AP75 | 38.9 | Mask R-CNN (ResNeXt-101-FPN) |
| 16k | COCO minival | box AP | 36.7 | Mask R-CNN (ResNeXt-101-FPN) |
| 16k | COCO | box AP | 45.2 | Mask R-CNN X-152-32x8d |