| 1 | BEiT-3 | 62.8 | Yes | Image as a Foreign Language: BEiT Pretraining fo... | 2022-08-22 | Code |
| 2 | ViT-CoMer | 62.1 | No | - | - | Code |
| 3 | EVA | 61.5 | No | EVA: Exploring the Limits of Masked Visual Repre... | 2022-11-14 | Code |
| 4 | FD-SwinV2-G | 61.4 | Yes | Contrastive Learning Rivals Masked Image Modelin... | 2022-05-27 | Code |
| 5 | MaskDINO-SwinL | 60.8 | Yes | Mask DINO: Towards A Unified Transformer-based F... | 2022-06-06 | Code |
| 6 | OneFormer (InternImage-H, emb_dim=256, multi-scale, 896x896) | 60.8 | No | OneFormer: One Transformer to Rule Universal Ima... | 2022-11-10 | Code |
| 7 | ViT-Adapter-L (Mask2Former, BEiT pretrain) | 60.5 | Yes | Vision Transformer Adapter for Dense Predictions | 2022-05-17 | Code |
| 8 | OneFormer (InternImage-H, emb_dim=256, single-scale, 896x896) | 60.4 | No | OneFormer: One Transformer to Rule Universal Ima... | 2022-11-10 | Code |
| 9 | SERNet-Former_v2 | 59.35 | Yes | SERNet-Former: Semantic Segmentation by Efficien... | 2024-01-28 | Code |
| 10 | X-Decoder (Davit-d5, Deform, single-scale, 1280x1280) | 59.1 | Yes | Generalized Decoding for Pixel, Image, and Langu... | 2022-12-21 | Code |
| 11 | OneFormer (DiNAT-L, single-scale, 1280x1280, COCO-Pretrain) | 58.9 | Yes | OneFormer: One Transformer to Rule Universal Ima... | 2022-11-10 | Code |
| 12 | OneFormer (DiNAT-L, multi-scale, 896x896) | 58.6 | No | OneFormer: One Transformer to Rule Universal Ima... | 2022-11-10 | Code |
| 13 | ViT-Adapter-L (UperNet, BEiT pretrain) | 58.4 | Yes | Vision Transformer Adapter for Dense Predictions | 2022-05-17 | Code |
| 14 | OneFormer (DiNAT-L, multi-scale, 640x640) | 58.4 | No | OneFormer: One Transformer to Rule Universal Ima... | 2022-11-10 | Code |
| 15 | RSSeg-ViT-L(BEiT pretrain) | 58.4 | No | Representation Separation for Semantic Segmentat... | 2022-12-28 | - |
| 16 | EoMT (DINOv2-L, single-scale, 512x512) | 58.4 | No | Your ViT is Secretly an Image Segmentation Model | 2025-03-24 | Code |
| 17 | OneFormer (Swin-L, multi-scale, 896x896) | 58.3 | No | OneFormer: One Transformer to Rule Universal Ima... | 2022-11-10 | Code |
| 18 | OneFormer (DiNAT-L, single-scale, 1280x1280) | 58.3 | No | OneFormer: One Transformer to Rule Universal Ima... | 2022-11-10 | Code |
| 19 | OneFormer (DiNAT-L, single-scale, 640x640) | 58.3 | No | OneFormer: One Transformer to Rule Universal Ima... | 2022-11-10 | Code |
| 20 | SeMask (SeMask Swin-L FaPN-Mask2Former) | 58.2 | No | SeMask: Semantically Masked Transformers for Sem... | 2021-12-23 | Code |
| 21 | SeMask (SeMask Swin-L MSFaPN-Mask2Former) | 58.2 | No | SeMask: Semantically Masked Transformers for Sem... | 2021-12-23 | Code |
| 22 | DiNAT-L (Mask2Former) | 58.1 | No | Dilated Neighborhood Attention Transformer | 2022-09-29 | Code |
| 23 | X-Decoder (L) | 58.1 | Yes | Generalized Decoding for Pixel, Image, and Langu... | 2022-12-21 | Code |
| 24 | Mask2Former (Swin-L-FaPN, multiscale) | 57.7 | No | Masked-attention Mask Transformer for Universal ... | 2021-12-02 | Code |
| 25 | OneFormer (Swin-L, multi-scale, 640x640) | 57.7 | No | OneFormer: One Transformer to Rule Universal Ima... | 2022-11-10 | Code |
| 26 | SeMask (SeMask Swin-L Mask2Former) | 57.5 | No | SeMask: Semantically Masked Transformers for Sem... | 2021-12-23 | Code |
| 27 | OneFormer (ConvNeXt-XL, single-scale, 640x640) | 57.4 | No | OneFormer: One Transformer to Rule Universal Ima... | 2022-11-10 | Code |
| 28 | SenFormer (BEiT-L) | 57.1 | No | Efficient Self-Ensemble for Semantic Segmentation | 2021-11-26 | Code |
| 29 | BEiT-L (ViT+UperNet, ImageNet-22k pretrain) | 57 | No | BEiT: BERT Pre-Training of Image Transformers | 2021-06-15 | Code |
| 30 | SeMask (SeMask Swin-L MSFaPN-Mask2Former, single-scale) | 57 | No | SeMask: Semantically Masked Transformers for Sem... | 2021-12-23 | Code |
| 31 | OneFormer (Swin-L, single-scale, 1280x1280) | 57 | No | OneFormer: One Transformer to Rule Universal Ima... | 2022-11-10 | Code |
| 32 | OneFormer (Swin-L, single-scale, 640x640) | 57 | No | OneFormer: One Transformer to Rule Universal Ima... | 2022-11-10 | Code |
| 33 | FaPN (MaskFormer, Swin-L, ImageNet-22k pretrain) | 56.7 | No | FaPN: Feature-aligned Pyramid Network for Dense ... | 2021-08-16 | Code |
| 34 | OneFormer (ConvNeXt-L, single-scale, 640x640) | 56.6 | No | OneFormer: One Transformer to Rule Universal Ima... | 2022-11-10 | Code |
| 35 | Mask2Former (Swin-L-FaPN) | 56.4 | No | Masked-attention Mask Transformer for Universal ... | 2021-12-02 | Code |
| 36 | DiNAT-L (Mask2Former, 640x640) | 56.3 | No | Dilated Neighborhood Attention Transformer | 2022-09-29 | Code |
| 37 | SeMask (SeMask Swin-L MaskFormer) | 56.2 | No | SeMask: Semantically Masked Transformers for Sem... | 2021-12-23 | Code |
| 38 | CSWin-L (UperNet, ImageNet-22k pretrain) | 55.7 | No | CSWin Transformer: A General Vision Transformer ... | 2021-07-01 | Code |
| 39 | MaskFormer (Swin-L, ImageNet-22k pretrain) | 55.6 | No | Per-Pixel Classification is Not All You Need for... | 2021-07-13 | Code |
| 40 | DeiT-L | 55.6 | No | DeiT III: Revenge of the ViT | 2022-04-14 | Code |
| 41 | Focal-L (UperNet, ImageNet-22k pretrain) | 55.4 | No | Focal Self-attention for Local-Global Interactio... | 2021-07-01 | Code |
| 42 | Mask2Former (Swin-L + FAPN, 640x640) | 55.4 | No | Masked-attention Mask Transformer for Universal ... | 2021-12-02 | Code |
| 43 | SegViT ViT-Large | 55.2 | No | SegViT: Semantic Segmentation with Plain Vision ... | 2022-10-12 | Code |
| 44 | kMaX-DeepLab (ConvNeXt-L, single-scale, 1281x1281) | 55.2 | No | kMaX-DeepLab: k-means Mask Transformer | 2022-07-08 | Code |
| 45 | kMaX-DeepLab (ConvNeXt-L, single-scale, 641x641) | 54.8 | No | kMaX-DeepLab: k-means Mask Transformer | 2022-07-08 | Code |
| 46 | Mask2Former (Swin-L) | 54.5 | No | Masked-attention Mask Transformer for Universal ... | 2021-12-02 | Code |
| 47 | K-Net | 54.3 | No | K-Net: Towards Unified Image Segmentation | 2021-06-28 | Code |
| 48 | DEPICT-SA (ViT-L 640x640 multi-scale) | 54.3 | No | Rethinking Decoders for Transformer-based Semant... | 2024-11-05 | Code |
| 49 | SenFormer (Swin-L) | 54.2 | No | Efficient Self-Ensemble for Semantic Segmentation | 2021-11-26 | Code |
| 50 | DeiT-B | 54.1 | No | DeiT III: Revenge of the ViT | 2022-04-14 | Code |
| 51 | MixMIM-L | 53.8 | No | MixMAE: Mixed and Masked Autoencoder for Efficie... | 2022-05-26 | Code |
| 52 | Seg-L-Mask/16 (MS, ViT-L) | 53.63 | No | Segmenter: Transformer for Semantic Segmentation | 2021-05-12 | Code |
| 53 | Swin-L (UperNet, ImageNet-22k pretrain) | 53.5 | No | Swin Transformer: Hierarchical Vision Transforme... | 2021-03-25 | Code |
| 54 | SeMask (SeMask Swin-L FPN) | 53.5 | No | SeMask: Semantically Masked Transformers for Sem... | 2021-12-23 | Code |
| 55 | PatchConvNet-L120 (UperNet) | 52.9 | No | Augmenting Convolutional networks with attention... | 2021-12-27 | Code |
| 56 | DEPICT-SA (ViT-L 640x640 single-scale) | 52.9 | No | Rethinking Decoders for Transformer-based Semant... | 2024-11-05 | Code |
| 57 | PatchConvNet-B120 (UperNet) | 52.8 | No | Augmenting Convolutional networks with attention... | 2021-12-27 | Code |
| 58 | SegFormer-B5(MS, 87M #Params, ImageNet-1K pretrain) | 51.8 | No | SegFormer: Simple and Efficient Design for Seman... | 2021-05-31 | Code |
| 59 | Light-Ham (VAN-Huge, 61M, IN-1k, MS) | 51.5 | No | Is Attention Better Than Matrix Decomposition? | 2021-09-09 | Code |
| 60 | PatchConvNet-B60 (UperNet) | 51.1 | No | Augmenting Convolutional networks with attention... | 2021-12-27 | Code |
| 61 | Light-Ham (VAN-Large, 46M, IN-1k, MS) | 51 | No | Is Attention Better Than Matrix Decomposition? | 2021-09-09 | Code |
| 62 | UperNet Shuffle-B | 50.5 | No | Shuffle Transformer: Rethinking Spatial Shuffle ... | 2021-06-07 | Code |
| 63 | ELSA-Swin-S | 50.3 | No | ELSA: Enhanced Local Self-Attention for Vision T... | 2021-12-23 | Code |
| 64 | MixMIM-B | 50.3 | No | MixMAE: Mixed and Masked Autoencoder for Efficie... | 2022-05-26 | Code |
| 65 | Twins-SVT-L (UperNet, ImageNet-1k pretrain) | 50.2 | No | Twins: Revisiting the Design of Spatial Attentio... | 2021-04-28 | Code |
| 66 | Seg-B-Mask/16 (MS, ViT-B) | 50 | No | Segmenter: Transformer for Semantic Segmentation | 2021-05-12 | Code |
| 67 | Panoptic-DeepLab (SwideRNet) | 50 | No | Masked-attention Mask Transformer for Universal ... | 2021-12-02 | Code |
| 68 | Swin-B (UperNet, ImageNet-1k pretrain) | 49.7 | No | Swin Transformer: Hierarchical Vision Transforme... | 2021-03-25 | Code |
| 69 | gSwin-S | 49.69 | No | gSwin: Gated MLP Vision Model with Hierarchical ... | 2022-08-24 | - |
| 70 | Seg-B/8 (MS, ViT-B) | 49.61 | No | Segmenter: Transformer for Semantic Segmentation | 2021-05-12 | Code |
| 71 | UperNet Shuffle-S | 49.6 | No | Shuffle Transformer: Rethinking Spatial Shuffle ... | 2021-06-07 | Code |
| 72 | Light-Ham (VAN-Base, 27M, IN-1k, MS) | 49.6 | No | Is Attention Better Than Matrix Decomposition? | 2021-09-09 | Code |
| 73 | PatchConvNet-S60 (UperNet) | 49.3 | No | Augmenting Convolutional networks with attention... | 2021-12-27 | Code |
| 74 | DPT-Hybrid | 49.02 | No | Vision Transformers for Dense Prediction | 2021-03-24 | Code |
| 75 | DaViT-S (UperNet) | 48.8 | No | DaViT: Dual Attention Vision Transformers | 2022-04-07 | Code |
| 76 | ResNeSt-200 | 48.36 | No | ResNeSt: Split-Attention Networks | 2020-04-19 | Code |
| 77 | HRNetV2 + OCR + RMI (PaddleClas pretrained) | 47.98 | No | Segmentation Transformer: Object-Contextual Repr... | 2019-09-24 | Code |
| 78 | gSwin-T | 47.63 | No | gSwin: Gated MLP Vision Model with Hierarchical ... | 2022-08-24 | - |
| 79 | ResNeSt-269 | 47.6 | No | ResNeSt: Split-Attention Networks | 2020-04-19 | Code |
| 80 | UperNet Shuffle-T | 47.6 | No | Shuffle Transformer: Rethinking Spatial Shuffle ... | 2021-06-07 | Code |
| 81 | DCNAS | 47.12 | No | DCNAS: Densely Connected Neural Architecture Sea... | 2020-03-26 | - |
| 82 | ResNeSt-101 | 46.91 | No | ResNeSt: Split-Attention Networks | 2020-04-19 | Code |
| 83 | Seg-S-Mask/16 (MS, ViT-S) | 46.9 | No | - | - | - |
| 84 | Swin-S (RPE w/ GAB) | 46.41 | No | Understanding Gaussian Attention Bias of Vision ... | 2023-05-08 | Code |
| 85 | DaViT-B (UperNet) | 46.3 | No | DaViT: Dual Attention Vision Transformers | 2022-04-07 | Code |
| 86 | CPN(ResNet-101) | 46.27 | No | Context Prior for Scene Segmentation | 2020-04-03 | Code |
| 87 | MultiMAE (ViT-B) | 46.2 | No | MultiMAE: Multi-modal Multi-task Masked Autoenco... | 2022-04-04 | Code |
| 88 | Mask2Former (ResNet-50, 640x640) | 46.1 | No | Masked-attention Mask Transformer for Universal ... | 2021-12-02 | Code |
| 89 | PyConvSegNet-152 | 45.99 | No | Pyramidal Convolution: Rethinking Convolutional ... | 2020-06-20 | Code |
| 90 | DNL | 45.97 | No | Disentangled Non-Local Neural Networks | 2020-06-11 | Code |
| 91 | CTNet | 45.94 | No | CTNet: Context-based Tandem Network for Semantic... | 2021-04-20 | Code |
| 92 | ACNet (ResNet-101) | 45.9 | No | Adaptive Context Network for Scene Parsing | 2019-11-05 | - |
| 93 | ACNet(ResNet-101) | 45.9 | No | Adaptive Context Network for Scene Parsing | 2019-11-05 | - |
| 94 | OCR (HRNetV2-W48) | 45.66 | No | Segmentation Transformer: Object-Contextual Repr... | 2019-09-24 | Code |
| 95 | EANet (ResNet-101) | 45.33 | No | Beyond Self-attention: External Attention using ... | 2021-05-05 | Code |
| 96 | kMaX-DeepLab (ResNet50, single-scale, 1281x1281) | 45.3 | No | kMaX-DeepLab: k-means Mask Transformer | 2022-07-08 | Code |
| 97 | OCR (ResNet-101) | 45.28 | No | Segmentation Transformer: Object-Contextual Repr... | 2019-09-24 | Code |
| 98 | Asymmetric ALNN | 45.24 | No | Asymmetric Non-local Neural Networks for Semanti... | 2019-08-21 | Code |
| 99 | gSwin-VT | 45.07 | No | gSwin: Gated MLP Vision Model with Hierarchical ... | 2022-08-24 | - |
| 100 | LaU-regression-loss | 45.02 | No | Location-aware Upsampling for Semantic Segmentat... | 2019-11-13 | Code |
| 101 | kMaX-DeepLab (ResNet50, single-scale, 641x641) | 45 | No | kMaX-DeepLab: k-means Mask Transformer | 2022-07-08 | Code |
| 102 | EncNet (ResNet-101) | 44.65 | No | Context Encoding for Semantic Segmentation | 2018-03-23 | Code |
| 103 | SGR (ResNet-101) | 44.32 | No | - | - | Code |
| 104 | Auto-DeepLab-L | 43.98 | No | Auto-DeepLab: Hierarchical Neural Architecture S... | 2019-01-10 | Code |
| 105 | PSANet (ResNet-101) | 43.77 | No | - | - | Code |
| 106 | DSSPN (ResNet-101) | 43.68 | No | Dynamic-structured Semantic Propagation Network | 2018-03-16 | - |
| 107 | HRNetV2 (HRNetV2-W48) | 42.99 | No | High-Resolution Representations for Labeling Pix... | 2019-04-09 | Code |
| 108 | UperNet (ResNet-101) | 42.66 | No | Unified Perceptual Parsing for Scene Understanding | 2018-07-26 | Code |
| 109 | RefineNet (ResNet-152) | 40.7 | No | RefineNet: Multi-Path Refinement Networks for Hi... | 2016-11-20 | Code |
| 110 | RefineNet (ResNet-101) | 40.2 | No | RefineNet: Multi-Path Refinement Networks for Hi... | 2016-11-20 | Code |
| 111 | DHR (Swin-L, Mask2Former) | 32.9 | No | DHR: Dual Features-Driven Hierarchical Rebalanci... | 2024-03-30 | Code |