| 1 | FD-SwinV2-G | 3000 | No | Contrastive Learning Rivals Masked Image Modelin... | 2022-05-27 | Code |
| 2 | RevCol-H (Mask2Former) | 2439 | Yes | Reversible Column Networks | 2022-12-22 | Code |
| 3 | BEiT-3 | 1900 | Yes | Image as a Foreign Language: BEiT Pretraining fo... | 2022-08-22 | Code |
| 4 | ViT-P (InternImage-H) | 1610 | Yes | The Missing Point in Vision Transformers for Uni... | 2025-05-26 | Code |
| 5 | ONE-PEACE | 1500 | Yes | ONE-PEACE: Exploring One General Representation ... | 2023-05-18 | Code |
| 6 | ViT-P (OneFormer, InternImage-H) | 1400 | No | The Missing Point in Vision Transformers for Uni... | 2025-05-26 | Code |
| 7 | InternImage-H | 1310 | Yes | InternImage: Exploring Large-Scale Vision Founda... | 2022-11-10 | Code |
| 8 | M3I Pre-training (InternImage-H) | 1310 | Yes | Towards All-in-one Pre-training via Maximizing M... | 2022-11-17 | Code |
| 9 | InternImage-H (M3I Pre-training) | 1310 | No | InternImage: Exploring Large-Scale Vision Founda... | 2022-11-10 | Code |
| 10 | DINOv2 (ViT-g/14 frozen model, w/ ViT-Adapter + Mask2former) | 1080 | No | DINOv2: Learning Robust Visual Features without ... | 2023-04-14 | Code |
| 11 | EVA | 1074 | Yes | EVA: Exploring the Limits of Masked Visual Repre... | 2022-11-14 | Code |
| 12 | ViT-Adapter-L (Mask2Former, BEiTv2 pretrain) | 571 | Yes | Vision Transformer Adapter for Dense Predictions | 2022-05-17 | Code |
| 13 | ViT-Adapter-L (Mask2Former, BEiT pretrain) | 571 | Yes | Vision Transformer Adapter for Dense Predictions | 2022-05-17 | Code |
| 14 | MOAT-4 (IN-22K pretraining, single-scale) | 496 | No | MOAT: Alternating Mobile Convolution and Attenti... | 2022-10-04 | Code |
| 15 | ViT-Adapter-L (UperNet, BEiT pretrain) | 451 | No | Vision Transformer Adapter for Dense Predictions | 2022-05-17 | Code |
| 16 | ConvNeXt-XL++ | 391 | No | A ConvNet for the 2020s | 2022-01-10 | Code |
| 17 | InternImage-XL | 368 | No | InternImage: Exploring Large-Scale Vision Founda... | 2022-11-10 | Code |
| 18 | RSSeg-ViT-L (BEiT pretrain) | 330 | No | Representation Separation for Semantic Segmentat... | 2022-12-28 | - |
| 19 | EoMT (DINOv2-L, single-scale, 512x512) | 316 | No | Your ViT is Secretly an Image Segmentation Model | 2025-03-24 | Code |
| 20 | ViT-P (OneFormer, DiNAT-L) | 309 | No | The Missing Point in Vision Transformers for Uni... | 2025-05-26 | Code |
| 21 | InternImage-L | 256 | No | InternImage: Exploring Large-Scale Vision Founda... | 2022-11-10 | Code |
| 22 | ConvNeXt-L++ | 235 | No | A ConvNet for the 2020s | 2022-01-10 | Code |
| 23 | MasK DINO (SwinL, multi-scale) | 223 | Yes | Mask DINO: Towards A Unified Transformer-based F... | 2022-06-06 | Code |
| 24 | Sequential Ensemble (SegFormer) | 216.3 | No | Sequential Ensembling for Semantic Segmentation | 2022-10-08 | - |
| 25 | LV-ViT-L (UperNet, MS) | 209 | No | All Tokens Matter: Token Labeling for Training B... | 2021-04-22 | Code |
| 26 | DDP (Swin-L, step-3) | 207 | No | DDP: Diffusion Model for Dense Visual Prediction | 2023-03-30 | Code |
| 27 | MOAT-3 (IN-22K pretraining, single-scale) | 198 | No | MOAT: Alternating Mobile Convolution and Attenti... | 2022-10-04 | Code |
| 28 | InternImage-B | 128 | No | InternImage: Exploring Large-Scale Vision Founda... | 2022-11-10 | Code |
| 29 | GC ViT-B | 125 | No | Global Context Vision Transformers | 2022-06-20 | Code |
| 30 | NAT-Base | 123 | No | Neighborhood Attention Transformer | 2022-04-14 | Code |
| 31 | ConvNeXt-B++ | 122 | No | A ConvNet for the 2020s | 2022-01-10 | Code |
| 32 | ConvNeXt-B | 122 | No | A ConvNet for the 2020s | 2022-01-10 | Code |
| 33 | DAT-B (UperNet) | 121 | No | Vision Transformer with Deformable Attention | 2022-01-03 | Code |
| 34 | TransNeXt-Base (IN-1K pretrain, Mask2Former, 512) | 109 | No | TransNeXt: Robust Foveal Visual Perception for V... | 2023-11-28 | Code |
| 35 | ActiveMLP-L(UperNet) | 108 | No | Active Token Mixer | 2022-03-11 | Code |
| 36 | SeMask (SeMask Swin-B FPN) | 96 | No | SeMask: Semantically Masked Transformers for Sem... | 2021-12-23 | Code |
| 37 | SegFormer-B5 | 84.7 | Yes | SegFormer: Simple and Efficient Design for Seman... | 2021-05-31 | Code |
| 38 | GC ViT-S | 84 | No | Global Context Vision Transformers | 2022-06-20 | Code |
| 39 | ConvNeXt-S | 82 | No | A ConvNet for the 2020s | 2022-01-10 | Code |
| 40 | NAT-Small | 82 | No | Neighborhood Attention Transformer | 2022-04-14 | Code |
| 41 | MOAT-2 (IN-22K pretraining, single-scale) | 81 | No | MOAT: Alternating Mobile Convolution and Attenti... | 2022-10-04 | Code |
| 42 | DAT-S (UperNet) | 81 | No | Vision Transformer with Deformable Attention | 2022-01-03 | Code |
| 43 | InternImage-S | 80 | No | InternImage: Exploring Large-Scale Vision Founda... | 2022-11-10 | Code |
| 44 | TransNeXt-Small (IN-1K pretrain, Mask2Former, 512) | 69 | No | TransNeXt: Robust Foveal Visual Perception for V... | 2023-11-28 | Code |
| 45 | SegFormer-B4 | 64.1 | Yes | SegFormer: Simple and Efficient Design for Seman... | 2021-05-31 | Code |
| 46 | Light-Ham (VAN-Huge) | 61.1 | No | Is Attention Better Than Matrix Decomposition? | 2021-09-09 | Code |
| 47 | ConvNeXt-T | 60 | No | A ConvNet for the 2020s | 2022-01-10 | Code |
| 48 | DAT-T (UperNet) | 60 | No | Vision Transformer with Deformable Attention | 2022-01-03 | Code |
| 49 | InternImage-T | 59 | No | InternImage: Exploring Large-Scale Vision Founda... | 2022-11-10 | Code |
| 50 | NAT-Tiny | 58 | No | Neighborhood Attention Transformer | 2022-04-14 | Code |
| 51 | GC ViT-T | 58 | No | Global Context Vision Transformers | 2022-06-20 | Code |
| 52 | SeMask (SeMask Swin-S FPN) | 56 | No | SeMask: Semantically Masked Transformers for Sem... | 2021-12-23 | Code |
| 53 | VAN-Large (HamNet) | 55 | No | Visual Attention Network | 2022-02-20 | Code |
| 54 | NAT-Mini | 50 | No | Neighborhood Attention Transformer | 2022-04-14 | Code |
| 55 | VAN-Large | 49 | No | Visual Attention Network | 2022-02-20 | Code |
| 56 | TransNeXt-Tiny (IN-1K pretrain, Mask2Former, 512) | 47.5 | No | TransNeXt: Robust Foveal Visual Perception for V... | 2023-11-28 | Code |
| 57 | Light-Ham (VAN-Large) | 45.6 | No | Is Attention Better Than Matrix Decomposition? | 2021-09-09 | Code |
| 58 | SeMask (SeMask Swin-T FPN) | 35 | No | SeMask: Semantically Masked Transformers for Sem... | 2021-12-23 | Code |
| 59 | HRViT-b3 (SegFormer, SS) | 28.7 | No | Multi-Scale High-Resolution Vision Transformer f... | 2021-11-01 | Code |
| 60 | Light-Ham (VAN-Base) | 27.4 | No | Is Attention Better Than Matrix Decomposition? | 2021-09-09 | Code |
| 61 | tiny-MOAT-3 (IN-1K pretraining, single scale) | 24 | No | MOAT: Alternating Mobile Convolution and Attenti... | 2022-10-04 | Code |
| 62 | HRViT-b2 (SegFormer, SS) | 20.8 | No | Multi-Scale High-Resolution Vision Transformer f... | 2021-11-01 | Code |
| 63 | VAN-Small | 18 | No | Visual Attention Network | 2022-02-20 | Code |
| 64 | Light-Ham (VAN-Small, D=256) | 13.8 | No | Is Attention Better Than Matrix Decomposition? | 2021-09-09 | Code |
| 65 | tiny-MOAT-2 (IN-1K pretraining, single scale) | 13 | No | MOAT: Alternating Mobile Convolution and Attenti... | 2022-10-04 | Code |
| 66 | HRViT-b1 (SegFormer, SS) | 8.2 | No | Multi-Scale High-Resolution Vision Transformer f... | 2021-11-01 | Code |
| 67 | tiny-MOAT-1 (IN-1K pretraining, single scale) | 8 | No | MOAT: Alternating Mobile Convolution and Attenti... | 2022-10-04 | Code |
| 68 | VAN-Tiny | 8 | No | Visual Attention Network | 2022-02-20 | Code |
| 69 | tiny-MOAT-0 (IN-1K pretraining, single scale) | 6 | No | MOAT: Alternating Mobile Convolution and Attenti... | 2022-10-04 | Code |
| 70 | SegFormer-B0 | 3.8 | Yes | SegFormer: Simple and Efficient Design for Seman... | 2021-05-31 | Code |