Metric: PQ (higher is better)
| # | Model↕ | PQ▼ | Extra Data | Paper | Date↕ | Code |
|---|---|---|---|---|---|---|
| 1 | OneFormer (InternImage-H, emb_dim=256, single-scale, 896x896) | 54.5 | No | OneFormer: One Transformer to Rule Universal Ima... | 2022-11-10 | Code |
| 2 | ViT-P (OneFormer, DiNAT-L, single-scale, 1280x1280, COCO_pretrain) | 54 | Yes | The Missing Point in Vision Transformers for Uni... | 2025-05-26 | Code |
| 3 | OpenSeed(SwinL, single scale, 1280x1280) | 53.7 | Yes | A Simple Framework for Open-Vocabulary Segmentat... | 2023-03-14 | Code |
| 4 | OneFormer (DiNAT-L, single-scale, 1280x1280, COCO-Pretrain) | 53.4 | Yes | OneFormer: One Transformer to Rule Universal Ima... | 2022-11-10 | Code |
| 5 | EoMT (DINOv2-g, single-scale, 1280x1280, COCO pre-trained) | 52.8 | Yes | Your ViT is Secretly an Image Segmentation Model | 2025-03-24 | Code |
| 6 | X-Decoder (Davit-d5, Deform, single-scale, 1280x1280) | 52.4 | Yes | Generalized Decoding for Pixel, Image, and Langu... | 2022-12-21 | Code |
| 7 | ViT-P (OneFormer, DiNAT-L, single-scale, 1280x1280) | 51.9 | No | The Missing Point in Vision Transformers for Uni... | 2025-05-26 | Code |
| 8 | OneFormer (DiNAT-L, single-scale, 1280x1280) | 51.5 | No | OneFormer: One Transformer to Rule Universal Ima... | 2022-11-10 | Code |
| 9 | OneFormer (Swin-L, single-scale, 1280x1280) | 51.4 | No | OneFormer: One Transformer to Rule Universal Ima... | 2022-11-10 | Code |
| 10 | kMaX-DeepLab (ConvNeXt-L, single-scale, 1281x1281) | 50.9 | No | kMaX-DeepLab: k-means Mask Transformer | 2022-07-08 | Code |
| 11 | OneFormer (DiNAT-L, single-scale, 640x640) | 50.5 | No | OneFormer: One Transformer to Rule Universal Ima... | 2022-11-10 | Code |
| 12 | OneFormer (ConvNeXt-XL, single-scale, 640x640) | 50.1 | No | OneFormer: One Transformer to Rule Universal Ima... | 2022-11-10 | Code |
| 13 | OneFormer (ConvNeXt-L, single-scale, 640x640) | 50 | No | OneFormer: One Transformer to Rule Universal Ima... | 2022-11-10 | Code |
| 14 | OneFormer (Swin-L, single-scale, 640x640) | 49.8 | No | OneFormer: One Transformer to Rule Universal Ima... | 2022-11-10 | Code |
| 15 | X-Decoder (L) | 49.6 | Yes | Generalized Decoding for Pixel, Image, and Langu... | 2022-12-21 | Code |
| 16 | DiNAT-L (Mask2Former, 640x640) | 49.4 | No | Dilated Neighborhood Attention Transformer | 2022-09-29 | Code |
| 17 | kMaX-DeepLab (ConvNeXt-L, single-scale, 641x641) | 48.7 | No | kMaX-DeepLab: k-means Mask Transformer | 2022-07-08 | Code |
| 18 | Mask2Former (Swin-L) | 48.1 | No | Masked-attention Mask Transformer for Universal ... | 2021-12-02 | Code |
| 19 | Mask2Former (Swin-L + FAPN, 640x640) | 46.2 | No | Masked-attention Mask Transformer for Universal ... | 2021-12-02 | Code |
| 20 | kMaX-DeepLab (ResNet50, single-scale, 1281x1281) | 42.3 | No | kMaX-DeepLab: k-means Mask Transformer | 2022-07-08 | Code |
| 21 | kMaX-DeepLab (ResNet50, single-scale, 641x641) | 41.5 | No | kMaX-DeepLab: k-means Mask Transformer | 2022-07-08 | Code |
| 22 | Mask2Former (ResNet-50, 640x640) | 39.7 | No | Masked-attention Mask Transformer for Universal ... | 2021-12-02 | Code |
| 23 | Panoptic-DeepLab (SwideRNet) | 37.9 | No | Masked-attention Mask Transformer for Universal ... | 2021-12-02 | Code |
| 24 | MaskFormer (R101 + 6 Enc) | 35.7 | No | Per-Pixel Classification is Not All You Need for... | 2021-07-13 | Code |