Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, Jiaya Jia
Scene parsing is challenging for unrestricted open vocabulary and diverse scenes. In this paper, we exploit the capability of global context information by different-region-based context aggregation through our pyramid pooling module together with the proposed pyramid scene parsing network (PSPNet). Our global prior representation is effective to produce good quality results on the scene parsing task, while PSPNet provides a superior framework for pixel-level prediction tasks. The proposed approach achieves state-of-the-art performance on various datasets. It came first in ImageNet scene parsing challenge 2016, PASCAL VOC 2012 benchmark and Cityscapes benchmark. A single PSPNet yields new record of mIoU accuracy 85.4% on PASCAL VOC 2012 and accuracy 80.2% on Cityscapes.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Medical Image Segmentation | Anatomical Tracings of Lesions After Stroke (ATLAS) | Dice | 0.3571 | PSPNet |
| Medical Image Segmentation | Anatomical Tracings of Lesions After Stroke (ATLAS) | IoU | 0.254 | PSPNet |
| Medical Image Segmentation | Anatomical Tracings of Lesions After Stroke (ATLAS) | Precision | 0.4769 | PSPNet |
| Medical Image Segmentation | Anatomical Tracings of Lesions After Stroke (ATLAS) | Recall | 0.3335 | PSPNet |
| Scene Parsing | Cityscapes val | mIoU | 79.7 | PSPNet-101 [20] |
| Scene Parsing | Cityscapes val | mIoU | 78.1 | PSPNet-50 [20] |
| Scene Parsing | CamVid | Mean IoU | 76 | PSPNet-50 |
| Semantic Segmentation | US3D | mIoU | 73.12 | PSNet |
| Semantic Segmentation | Fine-Grained Grass Segmentation Dataset | mIoU | 47.95 | PSPNet |
| Semantic Segmentation | Cityscapes val | mIoU | 79.7 | PSPNet (Dilated-ResNet-101) |
| Semantic Segmentation | BDD100K val | mIoU | 62.3 | PSPNet |
| Semantic Segmentation | SELMA | mIoU | 68.4 | PSPNet |
| Semantic Segmentation | Potsdam | mIoU | 82.98 | PSPNet |
| Semantic Segmentation | PASCAL Context | mIoU | 47.8 | PSPNet (ResNet-101) |
| Semantic Segmentation | UrbanLF | mIoU (Real) | 76.34 | PSPNet |
| Semantic Segmentation | UrbanLF | mIoU (Syn) | 75.78 | PSPNet |
| Semantic Segmentation | Vaihingen | mIoU | 76.79 | PSPNet |
| Semantic Segmentation | Trans10K | GFLOPs | 187.03 | PSPNet |
| Semantic Segmentation | DADA-seg | mIoU | 20.1 | PSPNet (ResNet-101) |
| Semantic Segmentation | ADE20K | Test Score | 55.38 | PSPNet |
| Semantic Segmentation | ADE20K | Validation mIoU | 44.94 | PSPNet |
| Semantic Segmentation | ADE20K | Validation mIoU | 43.51 | PSPNet (ResNet-152) |
| Semantic Segmentation | ADE20K | Validation mIoU | 43.29 | PSPNet (ResNet-101) |
| Semantic Segmentation | MFN Dataset | mIOU | 46.1 | PSPNet |
| Semantic Segmentation | CamVid | Frame (fps) | 5.4 | PSPNet |
| Semantic Segmentation | CamVid | Time (ms) | 185 | PSPNet |
| Semantic Segmentation | NYU Depth v2 | Speed(ms/f) | 72 | PSPNet101 |
| Semantic Segmentation | NYU Depth v2 | mIoU | 43.2 | PSPNet101 |
| Semantic Segmentation | NYU Depth v2 | Speed(ms/f) | 47 | PSPNet50 |
| Semantic Segmentation | NYU Depth v2 | mIoU | 41.8 | PSPNet50 |
| Semantic Segmentation | NYU Depth v2 | Speed(ms/f) | 19 | PSPNet18 |
| Semantic Segmentation | NYU Depth v2 | mIoU | 35.9 | PSPNet18 |
| Object Detection | DIS-TE4 | E-measure | 0.815 | PSPNet |
| Object Detection | DIS-TE4 | HCE | 3806 | PSPNet |
| Object Detection | DIS-TE4 | MAE | 0.107 | PSPNet |
| Object Detection | DIS-TE4 | S-Measure | 0.758 | PSPNet |
| Object Detection | DIS-TE4 | max F-Measure | 0.725 | PSPNet |
| Object Detection | DIS-TE4 | weighted F-measure | 0.63 | PSPNet |
| Object Detection | DIS-VD | E-measure | 0.802 | PSPNet |
| Object Detection | DIS-VD | HCE | 1588 | PSPNet |
| Object Detection | DIS-VD | MAE | 0.102 | PSPNet |
| Object Detection | DIS-VD | S-Measure | 0.744 | PSPNet |
| Object Detection | DIS-VD | max F-Measure | 0.691 | PSPNet |
| Object Detection | DIS-VD | weighted F-measure | 0.603 | PSPNet |
| Object Detection | DIS-TE2 | E-measure | 0.828 | PSPNet |
| Object Detection | DIS-TE2 | HCE | 586 | PSPNet |
| Object Detection | DIS-TE2 | MAE | 0.092 | PSPNet |
| Object Detection | DIS-TE2 | S-Measure | 0.763 | PSPNet |
| Object Detection | DIS-TE2 | max F-Measure | 0.724 | PSPNet |
| Object Detection | DIS-TE2 | weighted F-measure | 0.636 | PSPNet |
| Object Detection | DIS-TE1 | E-measure | 0.791 | PSPNet |
| Object Detection | DIS-TE1 | HCE | 267 | PSPNet |
| Object Detection | DIS-TE1 | MAE | 0.089 | PSPNet |
| Object Detection | DIS-TE1 | S-Measure | 0.725 | PSPNet |
| Object Detection | DIS-TE1 | max F-Measure | 0.645 | PSPNet |
| Object Detection | DIS-TE1 | weighted F-measure | 0.557 | PSPNet |
| Object Detection | DIS-TE3 | E-measure | 0.843 | PSPNet |
| Object Detection | DIS-TE3 | HCE | 1111 | PSPNet |
| Object Detection | DIS-TE3 | MAE | 0.092 | PSPNet |
| Object Detection | DIS-TE3 | S-Measure | 0.774 | PSPNet |
| Object Detection | DIS-TE3 | max F-Measure | 0.747 | PSPNet |
| Object Detection | DIS-TE3 | weighted F-measure | 0.657 | PSPNet |
| 3D | DIS-TE4 | E-measure | 0.815 | PSPNet |
| 3D | DIS-TE4 | HCE | 3806 | PSPNet |
| 3D | DIS-TE4 | MAE | 0.107 | PSPNet |
| 3D | DIS-TE4 | S-Measure | 0.758 | PSPNet |
| 3D | DIS-TE4 | max F-Measure | 0.725 | PSPNet |
| 3D | DIS-TE4 | weighted F-measure | 0.63 | PSPNet |
| 3D | DIS-VD | E-measure | 0.802 | PSPNet |
| 3D | DIS-VD | HCE | 1588 | PSPNet |
| 3D | DIS-VD | MAE | 0.102 | PSPNet |
| 3D | DIS-VD | S-Measure | 0.744 | PSPNet |
| 3D | DIS-VD | max F-Measure | 0.691 | PSPNet |
| 3D | DIS-VD | weighted F-measure | 0.603 | PSPNet |
| 3D | DIS-TE2 | E-measure | 0.828 | PSPNet |
| 3D | DIS-TE2 | HCE | 586 | PSPNet |
| 3D | DIS-TE2 | MAE | 0.092 | PSPNet |
| 3D | DIS-TE2 | S-Measure | 0.763 | PSPNet |
| 3D | DIS-TE2 | max F-Measure | 0.724 | PSPNet |
| 3D | DIS-TE2 | weighted F-measure | 0.636 | PSPNet |
| 3D | DIS-TE1 | E-measure | 0.791 | PSPNet |
| 3D | DIS-TE1 | HCE | 267 | PSPNet |
| 3D | DIS-TE1 | MAE | 0.089 | PSPNet |
| 3D | DIS-TE1 | S-Measure | 0.725 | PSPNet |
| 3D | DIS-TE1 | max F-Measure | 0.645 | PSPNet |
| 3D | DIS-TE1 | weighted F-measure | 0.557 | PSPNet |
| 3D | DIS-TE3 | E-measure | 0.843 | PSPNet |
| 3D | DIS-TE3 | HCE | 1111 | PSPNet |
| 3D | DIS-TE3 | MAE | 0.092 | PSPNet |
| 3D | DIS-TE3 | S-Measure | 0.774 | PSPNet |
| 3D | DIS-TE3 | max F-Measure | 0.747 | PSPNet |
| 3D | DIS-TE3 | weighted F-measure | 0.657 | PSPNet |
| Video Semantic Segmentation | Cityscapes val | mIoU | 79.7 | PSPNet-101 [20] |
| Video Semantic Segmentation | Cityscapes val | mIoU | 78.1 | PSPNet-50 [20] |
| Video Semantic Segmentation | CamVid | Mean IoU | 76 | PSPNet-50 |
| Scene Understanding | Cityscapes val | mIoU | 79.7 | PSPNet-101 [20] |
| Scene Understanding | Cityscapes val | mIoU | 78.1 | PSPNet-50 [20] |
| Scene Understanding | CamVid | Mean IoU | 76 | PSPNet-50 |
| RGB Salient Object Detection | DIS-TE4 | E-measure | 0.815 | PSPNet |
| RGB Salient Object Detection | DIS-TE4 | HCE | 3806 | PSPNet |
| RGB Salient Object Detection | DIS-TE4 | MAE | 0.107 | PSPNet |
| RGB Salient Object Detection | DIS-TE4 | S-Measure | 0.758 | PSPNet |
| RGB Salient Object Detection | DIS-TE4 | max F-Measure | 0.725 | PSPNet |
| RGB Salient Object Detection | DIS-TE4 | weighted F-measure | 0.63 | PSPNet |
| RGB Salient Object Detection | DIS-VD | E-measure | 0.802 | PSPNet |
| RGB Salient Object Detection | DIS-VD | HCE | 1588 | PSPNet |
| RGB Salient Object Detection | DIS-VD | MAE | 0.102 | PSPNet |
| RGB Salient Object Detection | DIS-VD | S-Measure | 0.744 | PSPNet |
| RGB Salient Object Detection | DIS-VD | max F-Measure | 0.691 | PSPNet |
| RGB Salient Object Detection | DIS-VD | weighted F-measure | 0.603 | PSPNet |
| RGB Salient Object Detection | DIS-TE2 | E-measure | 0.828 | PSPNet |
| RGB Salient Object Detection | DIS-TE2 | HCE | 586 | PSPNet |
| RGB Salient Object Detection | DIS-TE2 | MAE | 0.092 | PSPNet |
| RGB Salient Object Detection | DIS-TE2 | S-Measure | 0.763 | PSPNet |
| RGB Salient Object Detection | DIS-TE2 | max F-Measure | 0.724 | PSPNet |
| RGB Salient Object Detection | DIS-TE2 | weighted F-measure | 0.636 | PSPNet |
| RGB Salient Object Detection | DIS-TE1 | E-measure | 0.791 | PSPNet |
| RGB Salient Object Detection | DIS-TE1 | HCE | 267 | PSPNet |
| RGB Salient Object Detection | DIS-TE1 | MAE | 0.089 | PSPNet |
| RGB Salient Object Detection | DIS-TE1 | S-Measure | 0.725 | PSPNet |
| RGB Salient Object Detection | DIS-TE1 | max F-Measure | 0.645 | PSPNet |
| RGB Salient Object Detection | DIS-TE1 | weighted F-measure | 0.557 | PSPNet |
| RGB Salient Object Detection | DIS-TE3 | E-measure | 0.843 | PSPNet |
| RGB Salient Object Detection | DIS-TE3 | HCE | 1111 | PSPNet |
| RGB Salient Object Detection | DIS-TE3 | MAE | 0.092 | PSPNet |
| RGB Salient Object Detection | DIS-TE3 | S-Measure | 0.774 | PSPNet |
| RGB Salient Object Detection | DIS-TE3 | max F-Measure | 0.747 | PSPNet |
| RGB Salient Object Detection | DIS-TE3 | weighted F-measure | 0.657 | PSPNet |
| 2D Semantic Segmentation | Cityscapes val | mIoU | 79.7 | PSPNet-101 [20] |
| 2D Semantic Segmentation | Cityscapes val | mIoU | 78.1 | PSPNet-50 [20] |
| 2D Semantic Segmentation | CamVid | Mean IoU | 76 | PSPNet-50 |
| 2D Classification | DIS-TE4 | E-measure | 0.815 | PSPNet |
| 2D Classification | DIS-TE4 | HCE | 3806 | PSPNet |
| 2D Classification | DIS-TE4 | MAE | 0.107 | PSPNet |
| 2D Classification | DIS-TE4 | S-Measure | 0.758 | PSPNet |
| 2D Classification | DIS-TE4 | max F-Measure | 0.725 | PSPNet |
| 2D Classification | DIS-TE4 | weighted F-measure | 0.63 | PSPNet |
| 2D Classification | DIS-VD | E-measure | 0.802 | PSPNet |
| 2D Classification | DIS-VD | HCE | 1588 | PSPNet |
| 2D Classification | DIS-VD | MAE | 0.102 | PSPNet |
| 2D Classification | DIS-VD | S-Measure | 0.744 | PSPNet |
| 2D Classification | DIS-VD | max F-Measure | 0.691 | PSPNet |
| 2D Classification | DIS-VD | weighted F-measure | 0.603 | PSPNet |
| 2D Classification | DIS-TE2 | E-measure | 0.828 | PSPNet |
| 2D Classification | DIS-TE2 | HCE | 586 | PSPNet |
| 2D Classification | DIS-TE2 | MAE | 0.092 | PSPNet |
| 2D Classification | DIS-TE2 | S-Measure | 0.763 | PSPNet |
| 2D Classification | DIS-TE2 | max F-Measure | 0.724 | PSPNet |
| 2D Classification | DIS-TE2 | weighted F-measure | 0.636 | PSPNet |
| 2D Classification | DIS-TE1 | E-measure | 0.791 | PSPNet |
| 2D Classification | DIS-TE1 | HCE | 267 | PSPNet |
| 2D Classification | DIS-TE1 | MAE | 0.089 | PSPNet |
| 2D Classification | DIS-TE1 | S-Measure | 0.725 | PSPNet |
| 2D Classification | DIS-TE1 | max F-Measure | 0.645 | PSPNet |
| 2D Classification | DIS-TE1 | weighted F-measure | 0.557 | PSPNet |
| 2D Classification | DIS-TE3 | E-measure | 0.843 | PSPNet |
| 2D Classification | DIS-TE3 | HCE | 1111 | PSPNet |
| 2D Classification | DIS-TE3 | MAE | 0.092 | PSPNet |
| 2D Classification | DIS-TE3 | S-Measure | 0.774 | PSPNet |
| 2D Classification | DIS-TE3 | max F-Measure | 0.747 | PSPNet |
| 2D Classification | DIS-TE3 | weighted F-measure | 0.657 | PSPNet |
| Scene Segmentation | MFN Dataset | mIOU | 46.1 | PSPNet |
| 2D Object Detection | DIS-TE4 | E-measure | 0.815 | PSPNet |
| 2D Object Detection | DIS-TE4 | HCE | 3806 | PSPNet |
| 2D Object Detection | DIS-TE4 | MAE | 0.107 | PSPNet |
| 2D Object Detection | DIS-TE4 | S-Measure | 0.758 | PSPNet |
| 2D Object Detection | DIS-TE4 | max F-Measure | 0.725 | PSPNet |
| 2D Object Detection | DIS-TE4 | weighted F-measure | 0.63 | PSPNet |
| 2D Object Detection | DIS-VD | E-measure | 0.802 | PSPNet |
| 2D Object Detection | DIS-VD | HCE | 1588 | PSPNet |
| 2D Object Detection | DIS-VD | MAE | 0.102 | PSPNet |
| 2D Object Detection | DIS-VD | S-Measure | 0.744 | PSPNet |
| 2D Object Detection | DIS-VD | max F-Measure | 0.691 | PSPNet |
| 2D Object Detection | DIS-VD | weighted F-measure | 0.603 | PSPNet |
| 2D Object Detection | DIS-TE2 | E-measure | 0.828 | PSPNet |
| 2D Object Detection | DIS-TE2 | HCE | 586 | PSPNet |
| 2D Object Detection | DIS-TE2 | MAE | 0.092 | PSPNet |
| 2D Object Detection | DIS-TE2 | S-Measure | 0.763 | PSPNet |
| 2D Object Detection | DIS-TE2 | max F-Measure | 0.724 | PSPNet |
| 2D Object Detection | DIS-TE2 | weighted F-measure | 0.636 | PSPNet |
| 2D Object Detection | DIS-TE1 | E-measure | 0.791 | PSPNet |
| 2D Object Detection | DIS-TE1 | HCE | 267 | PSPNet |
| 2D Object Detection | DIS-TE1 | MAE | 0.089 | PSPNet |
| 2D Object Detection | DIS-TE1 | S-Measure | 0.725 | PSPNet |
| 2D Object Detection | DIS-TE1 | max F-Measure | 0.645 | PSPNet |
| 2D Object Detection | DIS-TE1 | weighted F-measure | 0.557 | PSPNet |
| 2D Object Detection | DIS-TE3 | E-measure | 0.843 | PSPNet |
| 2D Object Detection | DIS-TE3 | HCE | 1111 | PSPNet |
| 2D Object Detection | DIS-TE3 | MAE | 0.092 | PSPNet |
| 2D Object Detection | DIS-TE3 | S-Measure | 0.774 | PSPNet |
| 2D Object Detection | DIS-TE3 | max F-Measure | 0.747 | PSPNet |
| 2D Object Detection | DIS-TE3 | weighted F-measure | 0.657 | PSPNet |
| 2D Object Detection | MFN Dataset | mIOU | 46.1 | PSPNet |
| 10-shot image generation | US3D | mIoU | 73.12 | PSNet |
| 10-shot image generation | Fine-Grained Grass Segmentation Dataset | mIoU | 47.95 | PSPNet |
| 10-shot image generation | Cityscapes val | mIoU | 79.7 | PSPNet (Dilated-ResNet-101) |
| 10-shot image generation | BDD100K val | mIoU | 62.3 | PSPNet |
| 10-shot image generation | SELMA | mIoU | 68.4 | PSPNet |
| 10-shot image generation | Potsdam | mIoU | 82.98 | PSPNet |
| 10-shot image generation | PASCAL Context | mIoU | 47.8 | PSPNet (ResNet-101) |
| 10-shot image generation | UrbanLF | mIoU (Real) | 76.34 | PSPNet |
| 10-shot image generation | UrbanLF | mIoU (Syn) | 75.78 | PSPNet |
| 10-shot image generation | Vaihingen | mIoU | 76.79 | PSPNet |
| 10-shot image generation | Trans10K | GFLOPs | 187.03 | PSPNet |
| 10-shot image generation | DADA-seg | mIoU | 20.1 | PSPNet (ResNet-101) |
| 10-shot image generation | ADE20K | Test Score | 55.38 | PSPNet |
| 10-shot image generation | ADE20K | Validation mIoU | 44.94 | PSPNet |
| 10-shot image generation | ADE20K | Validation mIoU | 43.51 | PSPNet (ResNet-152) |
| 10-shot image generation | ADE20K | Validation mIoU | 43.29 | PSPNet (ResNet-101) |
| 10-shot image generation | MFN Dataset | mIOU | 46.1 | PSPNet |
| 10-shot image generation | CamVid | Frame (fps) | 5.4 | PSPNet |
| 10-shot image generation | CamVid | Time (ms) | 185 | PSPNet |
| 10-shot image generation | NYU Depth v2 | Speed(ms/f) | 72 | PSPNet101 |
| 10-shot image generation | NYU Depth v2 | mIoU | 43.2 | PSPNet101 |
| 10-shot image generation | NYU Depth v2 | Speed(ms/f) | 47 | PSPNet50 |
| 10-shot image generation | NYU Depth v2 | mIoU | 41.8 | PSPNet50 |
| 10-shot image generation | NYU Depth v2 | Speed(ms/f) | 19 | PSPNet18 |
| 10-shot image generation | NYU Depth v2 | mIoU | 35.9 | PSPNet18 |
| 16k | DIS-TE4 | E-measure | 0.815 | PSPNet |
| 16k | DIS-TE4 | HCE | 3806 | PSPNet |
| 16k | DIS-TE4 | MAE | 0.107 | PSPNet |
| 16k | DIS-TE4 | S-Measure | 0.758 | PSPNet |
| 16k | DIS-TE4 | max F-Measure | 0.725 | PSPNet |
| 16k | DIS-TE4 | weighted F-measure | 0.63 | PSPNet |
| 16k | DIS-VD | E-measure | 0.802 | PSPNet |
| 16k | DIS-VD | HCE | 1588 | PSPNet |
| 16k | DIS-VD | MAE | 0.102 | PSPNet |
| 16k | DIS-VD | S-Measure | 0.744 | PSPNet |
| 16k | DIS-VD | max F-Measure | 0.691 | PSPNet |
| 16k | DIS-VD | weighted F-measure | 0.603 | PSPNet |
| 16k | DIS-TE2 | E-measure | 0.828 | PSPNet |
| 16k | DIS-TE2 | HCE | 586 | PSPNet |
| 16k | DIS-TE2 | MAE | 0.092 | PSPNet |
| 16k | DIS-TE2 | S-Measure | 0.763 | PSPNet |
| 16k | DIS-TE2 | max F-Measure | 0.724 | PSPNet |
| 16k | DIS-TE2 | weighted F-measure | 0.636 | PSPNet |
| 16k | DIS-TE1 | E-measure | 0.791 | PSPNet |
| 16k | DIS-TE1 | HCE | 267 | PSPNet |
| 16k | DIS-TE1 | MAE | 0.089 | PSPNet |
| 16k | DIS-TE1 | S-Measure | 0.725 | PSPNet |
| 16k | DIS-TE1 | max F-Measure | 0.645 | PSPNet |
| 16k | DIS-TE1 | weighted F-measure | 0.557 | PSPNet |
| 16k | DIS-TE3 | E-measure | 0.843 | PSPNet |
| 16k | DIS-TE3 | HCE | 1111 | PSPNet |
| 16k | DIS-TE3 | MAE | 0.092 | PSPNet |
| 16k | DIS-TE3 | S-Measure | 0.774 | PSPNet |
| 16k | DIS-TE3 | max F-Measure | 0.747 | PSPNet |
| 16k | DIS-TE3 | weighted F-measure | 0.657 | PSPNet |