| 1 | ViT-P (InternImage-H) | 63.6 | Yes | The Missing Point in Vision Transformers for Uni... | 2025-05-26 | Code |
| 2 | ONE-PEACE | 63 | Yes | ONE-PEACE: Exploring One General Representation ... | 2023-05-18 | Code |
| 3 | InternImage-H | 62.9 | Yes | InternImage: Exploring Large-Scale Vision Founda... | 2022-11-10 | Code |
| 4 | M3I Pre-training (InternImage-H) | 62.9 | Yes | Towards All-in-one Pre-training via Maximizing M... | 2022-11-17 | Code |
| 5 | BEiT-3 | 62.8 | Yes | Image as a Foreign Language: BEiT Pretraining fo... | 2022-08-22 | Code |
| 6 | EVA | 62.3 | Yes | EVA: Exploring the Limits of Masked Visual Repre... | 2022-11-14 | Code |
| 7 | ViT-P (OneFormer, InternImage-H) | 61.6 | No | The Missing Point in Vision Transformers for Uni... | 2025-05-26 | Code |
| 8 | ViT-Adapter-L (Mask2Former, BEiTv2 pretrain) | 61.5 | Yes | Vision Transformer Adapter for Dense Predictions | 2022-05-17 | Code |
| 9 | FD-SwinV2-G | 61.4 | No | Contrastive Learning Rivals Masked Image Modelin... | 2022-05-27 | Code |
| 10 | RevCol-H (Mask2Former) | 61 | Yes | Reversible Column Networks | 2022-12-22 | Code |
| 11 | MasK DINO (SwinL, multi-scale) | 60.8 | Yes | Mask DINO: Towards A Unified Transformer-based F... | 2022-06-06 | Code |
| 12 | ViT-Adapter-L (Mask2Former, BEiT pretrain) | 60.5 | Yes | Vision Transformer Adapter for Dense Predictions | 2022-05-17 | Code |
| 13 | DINOv2 (ViT-g/14 frozen model, w/ ViT-Adapter + Mask2former) | 60.2 | No | DINOv2: Learning Robust Visual Features without ... | 2023-04-14 | Code |
| 14 | ViT-P (OneFormer, DiNAT-L) | 59.9 | No | The Missing Point in Vision Transformers for Uni... | 2025-05-26 | Code |
| 15 | SwinV2-G(UperNet) | 59.9 | Yes | Swin Transformer V2: Scaling Up Capacity and Res... | 2021-11-18 | Code |
| 16 | PIIP-LH6B(UperNet) | 59.9 | No | Parameter-Inverted Image Pyramid Networks | 2024-06-06 | Code |
| 17 | SERNet-Former | 59.35 | No | SERNet-Former: Semantic Segmentation by Efficien... | 2024-01-28 | Code |
| 18 | FocalNet-L (Mask2Former) | 58.5 | Yes | Focal Modulation Networks | 2022-03-22 | Code |
| 19 | ViT-Adapter-L (UperNet, BEiT pretrain) | 58.4 | No | Vision Transformer Adapter for Dense Predictions | 2022-05-17 | Code |
| 20 | RSSeg-ViT-L (BEiT pretrain) | 58.4 | No | Representation Separation for Semantic Segmentat... | 2022-12-28 | - |
| 21 | EoMT (DINOv2-L, single-scale, 512x512) | 58.4 | No | Your ViT is Secretly an Image Segmentation Model | 2025-03-24 | Code |
| 22 | SegViT-v2 (BEiT-v2-Large) | 58.2 | No | SegViTv2: Exploring Efficient and Continual Sema... | 2023-06-09 | Code |
| 23 | SeMask (SeMask Swin-L FaPN-Mask2Former) | 58.2 | No | SeMask: Semantically Masked Transformers for Sem... | 2021-12-23 | Code |
| 24 | SeMask (SeMask Swin-L MSFaPN-Mask2Former) | 58.2 | No | SeMask: Semantically Masked Transformers for Sem... | 2021-12-23 | Code |
| 25 | DiNAT-L (Mask2Former) | 58.1 | No | Dilated Neighborhood Attention Transformer | 2022-09-29 | Code |
| 26 | HorNet-L (Mask2Former) | 57.9 | No | HorNet: Efficient High-Order Spatial Interaction... | 2022-07-28 | Code |
| 27 | Mask2Former (SwinL-FaPN) | 57.7 | No | Masked-attention Mask Transformer for Universal ... | 2021-12-02 | Code |
| 28 | FASeg (SwinL) | 57.7 | No | Dynamic Focus-aware Positional Queries for Seman... | 2022-04-04 | Code |
| 29 | RR (BEiT-L) | 57.7 | No | Region Rebalance for Long-Tailed Semantic Segmen... | 2022-04-05 | Code |
| 30 | MOAT-4 (IN-22K pretraining, single-scale) | 57.6 | No | MOAT: Alternating Mobile Convolution and Attenti... | 2022-10-04 | Code |
| 31 | Frozen Backbone, SwinV2-G-ext22K (Mask2Former) | 57.6 | No | Could Giant Pretrained Image Models Extract Univ... | 2022-11-03 | - |
| 32 | SeMask (SeMask Swin-L Mask2Former) | 57.5 | No | SeMask: Semantically Masked Transformers for Sem... | 2021-12-23 | Code |
| 33 | Mask2Former (SwinL) | 57.3 | No | Masked-attention Mask Transformer for Universal ... | 2021-12-02 | Code |
| 34 | SenFormer (BEiT-L) | 57.1 | Yes | Efficient Self-Ensemble for Semantic Segmentation | 2021-11-26 | Code |
| 35 | BEiT-L (ViT+UperNet) | 57 | No | BEiT: BERT Pre-Training of Image Transformers | 2021-06-15 | Code |
| 36 | SeMask(SeMask Swin-L MSFaPN-Mask2Former, single-scale) | 57 | No | SeMask: Semantically Masked Transformers for Sem... | 2021-12-23 | Code |
| 37 | MetaPrompt-SD | 56.8 | No | Harnessing Diffusion Models for Visual Perceptio... | 2023-12-22 | Code |
| 38 | FaPN (MaskFormer, Swin-L, ImageNet-22k pretrain) | 56.7 | No | FaPN: Feature-aligned Pyramid Network for Dense ... | 2021-08-16 | Code |
| 39 | MOAT-3 (IN-22K pretraining, single-scale) | 56.5 | No | MOAT: Alternating Mobile Convolution and Attenti... | 2022-10-04 | Code |
| 40 | Mask2Former (Swin-L-FaPN) | 56.4 | No | Masked-attention Mask Transformer for Universal ... | 2021-12-02 | Code |
| 41 | SeMask (SeMask Swin-L MaskFormer) | 56.2 | No | SeMask: Semantically Masked Transformers for Sem... | 2021-12-23 | Code |
| 42 | dBOT ViT-L (CLIP) | 56.2 | No | Exploring Target Representations for Masked Auto... | 2022-09-08 | Code |
| 43 | Mask2Former+CBL(Swin-B) | 56.1 | No | - | - | Code |
| 44 | TADP | 55.9 | No | Text-image Alignment for Diffusion-based Percept... | 2023-09-29 | Code |
| 45 | CSWin-L (UperNet, ImageNet-22k pretrain) | 55.7 | No | CSWin Transformer: A General Vision Transformer ... | 2021-07-01 | Code |
| 46 | UniRepLKNet-XL | 55.6 | No | UniRepLKNet: A Universal Perception Large-Kernel... | 2023-11-27 | Code |
| 47 | Focal-L (UperNet, ImageNet-22k pretrain) | 55.4 | No | Focal Self-attention for Local-Global Interactio... | 2021-07-01 | Code |
| 48 | InternImage-XL | 55.3 | No | InternImage: Exploring Large-Scale Vision Founda... | 2022-11-10 | Code |
| 49 | dBOT ViT-L | 55.2 | No | Exploring Target Representations for Masked Auto... | 2022-09-08 | Code |
| 50 | Mask2Former(Swin-B) | 55.1 | No | Masked-attention Mask Transformer for Universal ... | 2021-12-02 | Code |
| 51 | ConvNeXt V2-H (FCMAE) | 55 | No | ConvNeXt V2: Co-designing and Scaling ConvNets w... | 2023-01-02 | Code |
| 52 | UniRepLKNet-L++ | 55 | No | UniRepLKNet: A Universal Perception Large-Kernel... | 2023-11-27 | Code |
| 53 | DiNAT-Large (UperNet) | 54.9 | No | Dilated Neighborhood Attention Transformer | 2022-09-29 | Code |
| 54 | MaskFormer+CBL(Swin-B) | 54.9 | No | - | - | Code |
| 55 | TransNeXt-Base (IN-1K pretrain, Mask2Former, 512) | 54.7 | No | TransNeXt: Robust Foveal Visual Perception for V... | 2023-11-28 | Code |
| 56 | MOAT-2 (IN-22K pretraining, single-scale) | 54.7 | No | MOAT: Alternating Mobile Convolution and Attenti... | 2022-10-04 | Code |
| 57 | CAE (ViT-L, UperNet) | 54.7 | No | Context Autoencoder for Self-Supervised Represen... | 2022-02-07 | Code |
| 58 | VAN-B6 | 54.7 | No | Visual Attention Network | 2022-02-20 | Code |
| 59 | DiNAT_s-Large (UperNet) | 54.6 | No | Dilated Neighborhood Attention Transformer | 2022-09-29 | Code |
| 60 | DDP (Swin-L, step-3) | 54.4 | No | DDP: Diffusion Model for Dense Visual Prediction | 2023-03-30 | Code |
| 61 | PatchDiverse + Swin-L (multi-scale test, upernet, ImageNet22k pretrain) | 54.4 | No | Vision Transformers with Patch Diversification | 2021-04-26 | Code |
| 62 | VOLO-D5 | 54.3 | No | VOLO: Vision Outlooker for Visual Recognition | 2021-06-24 | Code |
| 63 | K-Net | 54.3 | No | K-Net: Towards Unified Image Segmentation | 2021-06-28 | Code |
| 64 | GPaCo (Swin-L) | 54.3 | No | Generalized Parametric Contrastive Learning | 2022-09-26 | Code |
| 65 | SenFormer (Swin-L) | 54.2 | Yes | Efficient Self-Ensemble for Semantic Segmentation | 2021-11-26 | Code |
| 66 | Swin V2-H | 54.2 | No | ConvNeXt V2: Co-designing and Scaling ConvNets w... | 2023-01-02 | Code |
| 67 | InternImage-L | 54.1 | No | InternImage: Exploring Large-Scale Vision Founda... | 2022-11-10 | Code |
| 68 | TransNeXt-Small (IN-1K pretrain, Mask2Former, 512) | 54.1 | No | TransNeXt: Robust Foveal Visual Perception for V... | 2023-11-28 | Code |
| 69 | ConvNeXt-XL++ | 54 | No | A ConvNet for the 2020s | 2022-01-10 | Code |
| 70 | Sequential Ensemble (SegFormer) | 54 | No | Sequential Ensembling for Semantic Segmentation | 2022-10-08 | - |
| 71 | MogaNet-XL (UperNet) | 54 | No | MogaNet: Multi-order Gated Aggregation Network | 2022-11-07 | Code |
| 72 | UniRepLKNet-B++ | 53.9 | No | UniRepLKNet: A Universal Perception Large-Kernel... | 2023-11-27 | Code |
| 73 | MaskFormer(Swin-B) | 53.8 | No | Per-Pixel Classification is Not All You Need for... | 2021-07-13 | Code |
| 74 | ConvNeXt-L++ | 53.7 | No | A ConvNet for the 2020s | 2022-01-10 | Code |
| 75 | SwinV2-G-HTC++ Liu et al. ([2021a]) | 53.7 | No | Swin Transformer V2: Scaling Up Capacity and Res... | 2021-11-18 | Code |
| 76 | ConvNeXt V2-L | 53.7 | No | ConvNeXt V2: Co-designing and Scaling ConvNets w... | 2023-01-02 | Code |
| 77 | Seg-L-Mask/16 (MS) | 53.63 | No | Segmenter: Transformer for Semantic Segmentation | 2021-05-12 | Code |
| 78 | MAE (ViT-L, UperNet) | 53.6 | No | Masked Autoencoders Are Scalable Vision Learners | 2021-11-11 | Code |
| 79 | SeMask (SeMask Swin-L FPN) | 53.52 | No | SeMask: Semantically Masked Transformers for Sem... | 2021-12-23 | Code |
| 80 | Swin-L (UperNet, ImageNet-22k pretrain) | 53.5 | No | Swin Transformer: Hierarchical Vision Transforme... | 2021-03-25 | Code |
| 81 | Swin-L | 53.5 | No | ConvNeXt V2: Co-designing and Scaling ConvNets w... | 2023-01-02 | Code |
| 82 | TransNeXt-Tiny (IN-1K pretrain, Mask2Former, 512) | 53.4 | No | TransNeXt: Robust Foveal Visual Perception for V... | 2023-11-28 | Code |
| 83 | ConvNeXt-B++ | 53.1 | No | A ConvNet for the 2020s | 2022-01-10 | Code |
| 84 | PatchConvNet-L120 (UperNet) | 52.9 | No | Augmenting Convolutional networks with attention... | 2021-12-27 | Code |
| 85 | dBOT ViT-B (CLIP) | 52.9 | No | Exploring Target Representations for Masked Auto... | 2022-09-08 | Code |
| 86 | PatchConvNet-B120
(UperNet) | 52.8 | No | Augmenting Convolutional networks with attention... | 2021-12-27 | Code |
| 87 | Swin-B | 52.8 | No | ConvNeXt V2: Co-designing and Scaling ConvNets w... | 2023-01-02 | Code |
| 88 | UniRepLKNet-S++ | 52.7 | No | UniRepLKNet: A Universal Perception Large-Kernel... | 2023-11-27 | Code |
| 89 | ConvNeXt V2-B | 52.1 | No | ConvNeXt V2: Co-designing and Scaling ConvNets w... | 2023-01-02 | Code |
| 90 | DeBiFormer-B (IN1k pretrain, Upernet 160k) | 52 | No | DeBiFormer: Vision Transformer with Deformable A... | 2024-10-11 | Code |
| 91 | LV-ViT-L (UperNet, MS) | 51.8 | No | All Tokens Matter: Token Labeling for Training B... | 2021-04-22 | Code |
| 92 | SegFormer-B5 | 51.8 | Yes | SegFormer: Simple and Efficient Design for Seman... | 2021-05-31 | Code |
| 93 | BiFormer-B (IN1k pretrain, Upernet 160k) | 51.7 | No | BiFormer: Vision Transformer with Bi-Level Routi... | 2023-03-15 | Code |
| 94 | ConvNeXt V2-L (Supervised) | 51.6 | No | ConvNeXt V2: Co-designing and Scaling ConvNets w... | 2023-01-02 | Code |
| 95 | Light-Ham (VAN-Huge) | 51.5 | No | Is Attention Better Than Matrix Decomposition? | 2021-09-09 | Code |
| 96 | DAT-B++ | 51.5 | No | DAT++: Spatially Dynamic Vision Transformer with... | 2023-09-04 | Code |
| 97 | CrossFormer (ImageNet1k-pretrain, UPerNet, multi-scale test) | 51.4 | No | CrossFormer: A Versatile Vision Transformer Hing... | 2021-07-31 | Code |
| 98 | InternImage-B | 51.3 | No | InternImage: Exploring Large-Scale Vision Founda... | 2022-11-10 | Code |
| 99 | DAT-S++ | 51.2 | No | DAT++: Spatially Dynamic Vision Transformer with... | 2023-09-04 | Code |
| 100 | ActiveMLP-L(UperNet) | 51.1 | No | Active Token Mixer | 2022-03-11 | Code |
| 101 | SegFormer-B4 | 51.1 | Yes | SegFormer: Simple and Efficient Design for Seman... | 2021-05-31 | Code |
| 102 | PatchConvNet-B60 (UperNet) | 51.1 | No | Augmenting Convolutional networks with attention... | 2021-12-27 | Code |
| 103 | Light-Ham (VAN-Large) | 51 | No | Is Attention Better Than Matrix Decomposition? | 2021-09-09 | Code |
| 104 | TEC (Vit-B, Upernet) | 51 | No | Towards Sustainable Self-supervised Learning | 2022-10-20 | Code |
| 105 | UniRepLKNet-S | 51 | No | UniRepLKNet: A Universal Perception Large-Kernel... | 2023-11-27 | Code |
| 106 | SeMask (SeMask Swin-B FPN) | 50.98 | No | SeMask: Semantically Masked Transformers for Sem... | 2021-12-23 | Code |
| 107 | InternImage-S | 50.9 | No | InternImage: Exploring Large-Scale Vision Founda... | 2022-11-10 | Code |
| 108 | MogaNet-L (UperNet) | 50.9 | No | MogaNet: Multi-order Gated Aggregation Network | 2022-11-07 | Code |
| 109 | dBOT ViT-B | 50.8 | No | Exploring Target Representations for Masked Auto... | 2022-09-08 | Code |
| 110 | Upernet-BiFormer-S (IN1k pretrain, Upernet 160k) | 50.8 | No | BiFormer: Vision Transformer with Bi-Level Routi... | 2023-03-15 | Code |
| 111 | UperNet Shuffle-B | 50.5 | No | Shuffle Transformer: Rethinking Spatial Shuffle ... | 2021-06-07 | Code |
| 112 | ConvNeXt V1-L | 50.5 | No | ConvNeXt V2: Co-designing and Scaling ConvNets w... | 2023-01-02 | Code |
| 113 | DiNAT-Base (UperNet) | 50.4 | No | Dilated Neighborhood Attention Transformer | 2022-09-29 | Code |
| 114 | ELSA-Swin-S | 50.3 | No | ELSA: Enhanced Local Self-Attention for Vision T... | 2021-12-23 | Code |
| 115 | DAT-T++ | 50.3 | No | DAT++: Spatially Dynamic Vision Transformer with... | 2023-09-04 | Code |
| 116 | SETR-MLA (160k, MS) | 50.28 | No | Rethinking Semantic Segmentation from a Sequence... | 2020-12-31 | Code |
| 117 | VAN-Large (HamNet) | 50.2 | No | Visual Attention Network | 2022-02-20 | Code |
| 118 | HRViT-b3 (SegFormer, SS) | 50.2 | No | Multi-Scale High-Resolution Vision Transformer f... | 2021-11-01 | Code |
| 119 | Twins-SVT-L (UperNet, ImageNet-1k pretrain) | 50.2 | No | Twins: Revisiting the Design of Spatial Attentio... | 2021-04-28 | Code |
| 120 | MogaNet-B (UperNet) | 50.1 | No | MogaNet: Multi-order Gated Aggregation Network | 2022-11-07 | Code |
| 121 | Seg-B-Mask/16(MS, ViT-B) | 50 | No | Segmenter: Transformer for Semantic Segmentation | 2021-05-12 | Code |
| 122 | iBOT (ViT-B/16) | 50 | No | iBOT: Image BERT Pre-Training with Online Tokeni... | 2021-11-15 | Code |
| 123 | ConvNeXt-B | 49.9 | No | A ConvNet for the 2020s | 2022-01-10 | Code |
| 124 | DiNAT-Small (UperNet) | 49.9 | No | Dilated Neighborhood Attention Transformer | 2022-09-29 | Code |
| 125 | ConvNeXt V1-B | 49.9 | No | ConvNeXt V2: Co-designing and Scaling ConvNets w... | 2023-01-02 | Code |
| 126 | NAT-Base | 49.7 | No | Neighborhood Attention Transformer | 2022-04-14 | Code |
| 127 | Swin-B (UperNet, ImageNet-1k pretrain) | 49.7 | No | Swin Transformer: Hierarchical Vision Transforme... | 2021-03-25 | Code |
| 128 | Seg-B/8 (MS, ViT-B) | 49.61 | No | Segmenter: Transformer for Semantic Segmentation | 2021-05-12 | Code |
| 129 | ConvNeXt-S | 49.6 | No | A ConvNet for the 2020s | 2022-01-10 | Code |
| 130 | Light-Ham (VAN-Base) | 49.6 | No | Is Attention Better Than Matrix Decomposition? | 2021-09-09 | Code |
| 131 | NAT-Small | 49.5 | No | Neighborhood Attention Transformer | 2022-04-14 | Code |
| 132 | DaViT-B | 49.4 | No | DaViT: Dual Attention Vision Transformers | 2022-04-07 | Code |
| 133 | DAT-B (UperNet) | 49.38 | No | Vision Transformer with Deformable Attention | 2022-01-03 | Code |
| 134 | PatchConvNet-S60 (UperNet) | 49.3 | No | Augmenting Convolutional networks with attention... | 2021-12-27 | Code |
| 135 | ColorMAE-Green-ViTB-1600 | 49.3 | No | ColorMAE: Exploring data-independent masking str... | 2024-07-17 | Code |
| 136 | MogaNet-S (UperNet) | 49.2 | No | MogaNet: Multi-order Gated Aggregation Network | 2022-11-07 | Code |
| 137 | Shift-B (UperNet) | 49.2 | No | When Shift Operation Meets Vision Transformer: A... | 2022-01-26 | Code |
| 138 | UniRepLKNet-T | 49.1 | No | UniRepLKNet: A Universal Perception Large-Kernel... | 2023-11-27 | Code |
| 139 | DPT-Hybrid | 49.02 | No | Vision Transformers for Dense Prediction | 2021-03-24 | Code |
| 140 | GC ViT-B | 49 | No | Global Context Vision Transformers | 2022-06-20 | Code |
| 141 | A2MIM (ViT-B) | 49 | No | Architecture-Agnostic Masked Image Modeling -- F... | 2022-05-27 | Code |
| 142 | EfficientViT-B3 (r512) | 49 | No | EfficientViT: Multi-Scale Linear Attention for H... | 2022-05-29 | Code |
| 143 | DiNAT-Tiny (UperNet) | 48.8 | No | Dilated Neighborhood Attention Transformer | 2022-09-29 | Code |
| 144 | HRViT-b2 (SegFormer, SS) | 48.76 | No | Multi-Scale High-Resolution Vision Transformer f... | 2021-11-01 | Code |
| 145 | NAT-Tiny | 48.4 | No | Neighborhood Attention Transformer | 2022-04-14 | Code |
| 146 | XCiT-M24/8 (UperNet) | 48.4 | No | XCiT: Cross-Covariance Image Transformers | 2021-06-17 | Code |
| 147 | ResNeSt-200 | 48.36 | No | ResNeSt: Split-Attention Networks | 2020-04-19 | Code |
| 148 | DAT-S (UperNet) | 48.31 | No | Vision Transformer with Deformable Attention | 2022-01-03 | Code |
| 149 | GC ViT-S | 48.3 | No | Global Context Vision Transformers | 2022-06-20 | Code |
| 150 | InternImage-T | 48.1 | No | InternImage: Exploring Large-Scale Vision Founda... | 2022-11-10 | Code |
| 151 | VAN-Large | 48.1 | No | Visual Attention Network | 2022-02-20 | Code |
| 152 | XCiT-S24/8 (UperNet) | 48.1 | No | XCiT: Cross-Covariance Image Transformers | 2021-06-17 | Code |
| 153 | MaskFormer(ResNet-101) | 48.1 | No | Per-Pixel Classification is Not All You Need for... | 2021-07-13 | Code |
| 154 | MAE (ViT-B, UperNet) | 48.1 | No | Masked Autoencoders Are Scalable Vision Learners | 2021-11-11 | Code |
| 155 | HRNetV2 + OCR + RMI (PaddleClas pretrained) | 47.98 | No | Segmentation Transformer: Object-Contextual Repr... | 2019-09-24 | Code |
| 156 | Shift-B | 47.9 | No | When Shift Operation Meets Vision Transformer: A... | 2022-01-26 | Code |
| 157 | Shift-S | 47.8 | No | When Shift Operation Meets Vision Transformer: A... | 2022-01-26 | Code |
| 158 | MogaNet-S (Semantic FPN) | 47.7 | No | MogaNet: Multi-order Gated Aggregation Network | 2022-11-07 | Code |
| 159 | SeMask (SeMask Swin-S FPN) | 47.63 | No | SeMask: Semantically Masked Transformers for Sem... | 2021-12-23 | Code |
| 160 | ResNeSt-269 | 47.6 | No | ResNeSt: Split-Attention Networks | 2020-04-19 | Code |
| 161 | UperNet Shuffle-T | 47.6 | No | Shuffle Transformer: Rethinking Spatial Shuffle ... | 2021-06-07 | Code |
| 162 | CondNet(ResNest-101) | 47.54 | No | CondNet: Conditional Classifier for Scene Segmen... | 2021-09-21 | Code |
| 163 | tiny-MOAT-3 (IN-1K pretraining, single scale) | 47.5 | No | MOAT: Alternating Mobile Convolution and Attenti... | 2022-10-04 | Code |
| 164 | CondNet(ResNet-101) | 47.38 | No | CondNet: Conditional Classifier for Scene Segmen... | 2021-09-21 | Code |
| 165 | DiNAT-Mini (UperNet) | 47.2 | No | Dilated Neighborhood Attention Transformer | 2022-09-29 | Code |
| 166 | DCNAS | 47.12 | No | DCNAS: Densely Connected Neural Architecture Sea... | 2020-03-26 | - |
| 167 | XCiT-S24/8 (Semantic-FPN) | 47.1 | No | XCiT: Cross-Covariance Image Transformers | 2021-06-17 | Code |
| 168 | ResNeSt-101 | 46.91 | No | ResNeSt: Split-Attention Networks | 2020-04-19 | Code |
| 169 | XCiT-M24/8 (Semantic-FPN) | 46.9 | No | XCiT: Cross-Covariance Image Transformers | 2021-06-17 | Code |
| 170 | HamNet (ResNet-101) | 46.8 | No | Is Attention Better Than Matrix Decomposition? | 2021-09-09 | Code |
| 171 | Sequential Ensemble (DeepLabv3+) | 46.8 | No | Sequential Ensembling for Semantic Segmentation | 2022-10-08 | - |
| 172 | ConvNeXt-T | 46.7 | No | A ConvNet for the 2020s | 2022-01-10 | Code |
| 173 | VAN-Base (Semantic-FPN) | 46.7 | No | Visual Attention Network | 2022-02-20 | Code |
| 174 | XCiT-S12/8 (UperNet) | 46.6 | No | XCiT: Cross-Covariance Image Transformers | 2021-06-17 | Code |
| 175 | GC ViT-T | 46.5 | No | Global Context Vision Transformers | 2022-06-20 | Code |
| 176 | NAT-Mini | 46.4 | No | Neighborhood Attention Transformer | 2022-04-14 | Code |
| 177 | Shift-T | 46.3 | No | When Shift Operation Meets Vision Transformer: A... | 2022-01-26 | Code |
| 178 | DaViT-T | 46.3 | No | DaViT: Dual Attention Vision Transformers | 2022-04-07 | Code |
| 179 | CPN(ResNet-101) | 46.27 | No | Context Prior for Scene Segmentation | 2020-04-03 | Code |
| 180 | MultiMAE (ViT-B) | 46.2 | No | MultiMAE: Multi-modal Multi-task Masked Autoenco... | 2022-04-04 | Code |
| 181 | DRAN(ResNet-101) | 46.18 | No | - | - | Code |
| 182 | PyConvSegNet-152 | 45.99 | No | Pyramidal Convolution: Rethinking Convolutional ... | 2020-06-20 | Code |
| 183 | DNL | 45.97 | No | Disentangled Non-Local Neural Networks | 2020-06-11 | Code |
| 184 | ACNet (ResNet-101) | 45.9 | No | Adaptive Context Network for Scene Parsing | 2019-11-05 | - |
| 185 | ACNet
(ResNet-101) | 45.9 | No | Adaptive Context Network for Scene Parsing | 2019-11-05 | - |
| 186 | HRViT-b1 (SegFormer, SS) | 45.88 | No | Multi-Scale High-Resolution Vision Transformer f... | 2021-11-01 | Code |
| 187 | OCR(HRNetV2-W48) | 45.66 | No | Segmentation Transformer: Object-Contextual Repr... | 2019-09-24 | Code |
| 188 | SPNet (ResNet-101) | 45.6 | No | Strip Pooling: Rethinking Spatial Pooling for Sc... | 2020-03-30 | Code |
| 189 | Swin-T (UPerNet) MoBY | 45.58 | No | Self-Supervised Learning with Swin Transformers | 2021-05-10 | Code |
| 190 | DAT-T (UperNet) | 45.54 | No | Vision Transformer with Deformable Attention | 2022-01-03 | Code |
| 191 | iBOT (ViT-S/16) | 45.4 | No | iBOT: Image BERT Pre-Training with Online Tokeni... | 2021-11-15 | Code |
| 192 | EANet
(ResNet-101) | 45.33 | No | Beyond Self-attention: External Attention using ... | 2021-05-05 | Code |
| 193 | OCR (ResNet-101) | 45.28 | No | Segmentation Transformer: Object-Contextual Repr... | 2019-09-24 | Code |
| 194 | Asymmetric ALNN | 45.24 | No | Asymmetric Non-local Neural Networks for Semanti... | 2019-08-21 | Code |
| 195 | Light-Ham (VAN-Small, D=256) | 45.2 | No | Is Attention Better Than Matrix Decomposition? | 2021-09-09 | Code |
| 196 | LaU-regression-loss | 45.02 | No | Location-aware Upsampling for Semantic Segmentat... | 2019-11-13 | Code |
| 197 | PSPNet | 44.94 | No | Pyramid Scene Parsing Network | 2016-12-04 | Code |
| 198 | tiny-MOAT-2 (IN-1K pretraining, single scale) | 44.9 | No | MOAT: Alternating Mobile Convolution and Attenti... | 2022-10-04 | Code |
| 199 | CFNet(ResNet-101) | 44.89 | No | - | - | Code |
| 200 | EncNet | 44.65 | No | Context Encoding for Semantic Segmentation | 2018-03-23 | Code |
| 201 | LaU-offset-loss | 44.55 | No | Location-aware Upsampling for Semantic Segmentat... | 2019-11-13 | Code |
| 202 | EncNet + JPU | 44.34 | No | FastFCN: Rethinking Dilated Convolution in the B... | 2019-03-28 | Code |
| 203 | SGR (ResNet-101) | 44.32 | No | - | - | Code |
| 204 | XCiT-S12/8 (Semantic-FPN) | 44.2 | No | XCiT: Cross-Covariance Image Transformers | 2021-06-17 | Code |
| 205 | Auto-DeepLab-L | 43.98 | No | Auto-DeepLab: Hierarchical Neural Architecture S... | 2019-01-10 | Code |
| 206 | PSANet (ResNet-101) | 43.77 | No | - | - | Code |
| 207 | DSSPN (ResNet-101) | 43.68 | No | Dynamic-structured Semantic Propagation Network | 2018-03-16 | - |
| 208 | PSPNet (ResNet-152) | 43.51 | No | Pyramid Scene Parsing Network | 2016-12-04 | Code |
| 209 | PSPNet
(ResNet-101) | 43.29 | No | Pyramid Scene Parsing Network | 2016-12-04 | Code |
| 210 | HRNetV2 | 43.2 | No | High-Resolution Representations for Labeling Pix... | 2019-04-09 | Code |
| 211 | SeMask (SeMask Swin-T FPN) | 43.16 | No | SeMask: Semantically Masked Transformers for Sem... | 2021-12-23 | Code |
| 212 | tiny-MOAT-1 (IN-1K pretraining, single scale) | 43.1 | No | MOAT: Alternating Mobile Convolution and Attenti... | 2022-10-04 | Code |
| 213 | VAN-Small | 42.9 | No | Visual Attention Network | 2022-02-20 | Code |
| 214 | PoolFormer-M48 | 42.7 | No | MetaFormer Is Actually What You Need for Vision | 2021-11-22 | Code |
| 215 | UperNet (ResNet-101) | 42.66 | No | Unified Perceptual Parsing for Scene Understanding | 2018-07-26 | Code |
| 216 | tiny-MOAT-0 (IN-1K pretraining, single scale) | 41.2 | No | MOAT: Alternating Mobile Convolution and Attenti... | 2022-10-04 | Code |
| 217 | RefineNet | 40.7 | No | RefineNet: Multi-Path Refinement Networks for Hi... | 2016-11-20 | Code |
| 218 | FBNetV5 | 40.4 | No | FBNetV5: Neural Architecture Search for Multiple... | 2021-11-19 | - |
| 219 | ConvMLP-L | 40 | No | ConvMLP: Hierarchical Convolutional MLPs for Vis... | 2021-09-09 | Code |
| 220 | ConvMLP-M | 38.6 | No | ConvMLP: Hierarchical Convolutional MLPs for Vis... | 2021-09-09 | Code |
| 221 | VAN-Tiny | 38.5 | No | Visual Attention Network | 2022-02-20 | Code |
| 222 | A2MIM (ResNet-50) | 38.3 | No | Architecture-Agnostic Masked Image Modeling -- F... | 2022-05-27 | Code |
| 223 | iBOT (ViT-B/16) (linear head) | 38.3 | No | iBOT: Image BERT Pre-Training with Online Tokeni... | 2021-11-15 | Code |
| 224 | SegFormer-B0 | 37.4 | Yes | SegFormer: Simple and Efficient Design for Seman... | 2021-05-31 | Code |
| 225 | MUXNet-m + PPM | 35.8 | No | MUXConv: Information Multiplexing in Convolution... | 2020-03-31 | Code |
| 226 | ConvMLP-S | 35.8 | No | ConvMLP: Hierarchical Convolutional MLPs for Vis... | 2021-09-09 | Code |
| 227 | MUXNet-m + C1 | 32.42 | No | MUXConv: Information Multiplexing in Convolution... | 2020-03-31 | Code |
| 228 | DilatedNet | 32.31 | No | Multi-Scale Context Aggregation by Dilated Convo... | 2015-11-23 | Code |
| 229 | FCN | 29.39 | Yes | Fully Convolutional Networks for Semantic Segmen... | 2014-11-14 | Code |
| 230 | SegNet | 21.64 | No | SegNet: A Deep Convolutional Encoder-Decoder Arc... | 2015-11-02 | Code |