Metric: AP (higher is better)
| # | Model↕ | AP▼ | Extra Data | Paper | Date↕ | Code |
|---|---|---|---|---|---|---|
| 1 | OneFormer (InternImage-H, emb_dim=1024, single-scale, 896x896, COCO-Pretrained) | 44.2 | No | OneFormer: One Transformer to Rule Universal Ima... | 2022-11-10 | Code |
| 2 | OpenSeeD | 42.6 | Yes | A Simple Framework for Open-Vocabulary Segmentat... | 2023-03-14 | Code |
| 3 | ViT-P (OneFormer, DiNAT-L, single-scale, 1280x1280, COCO_pretrain) | 40.7 | Yes | The Missing Point in Vision Transformers for Uni... | 2025-05-26 | Code |
| 4 | OneFormer (DiNAT-L, single-scale, 1280x1280, COCO-pretrain) | 40.2 | Yes | OneFormer: One Transformer to Rule Universal Ima... | 2022-11-10 | Code |
| 5 | X-Decoder (Davit-d5, Deform, single-scale, 1280x1280) | 38.7 | Yes | Generalized Decoding for Pixel, Image, and Langu... | 2022-12-21 | Code |
| 6 | ViT-P (OneFormer, DiNAT-L, single-scale, 1280x1280) | 37.8 | No | The Missing Point in Vision Transformers for Uni... | 2025-05-26 | Code |
| 7 | OneFormer (DiNAT-L, single-scale) | 36 | No | OneFormer: One Transformer to Rule Universal Ima... | 2022-11-10 | Code |
| 8 | OneFormer (Swin-L, single-scale) | 35.9 | No | OneFormer: One Transformer to Rule Universal Ima... | 2022-11-10 | Code |
| 9 | X-Decoder (L) | 35.8 | Yes | Generalized Decoding for Pixel, Image, and Langu... | 2022-12-21 | Code |
| 10 | DiNAT-L (Mask2Former, single-scale) | 35.4 | No | Dilated Neighborhood Attention Transformer | 2022-09-29 | Code |
| 11 | Mask2Former (Swin-L, single-scale) | 34.9 | No | Masked-attention Mask Transformer for Universal ... | 2021-12-02 | Code |
| 12 | Mask2Former (Swin-L + FAPN) | 33.4 | No | Masked-attention Mask Transformer for Universal ... | 2021-12-02 | Code |
| 13 | Mask2Former (ResNet50) | 26.4 | No | Masked-attention Mask Transformer for Universal ... | 2021-12-02 | Code |