Semantic Segmentation on ADE20K

Metric: Validation mIoU (higher is better)

LeaderboardDataset

Loading chart...

Results

Hide extra data

Sort:

#	Model↕	Validation mIoU▼	Extra Data	Paper	Date↕	Code
1	ViT-P (InternImage-H)	63.6	Yes	The Missing Point in Vision Transformers for Uni...	2025-05-26	Code
2	ONE-PEACE	63	Yes	ONE-PEACE: Exploring One General Representation ...	2023-05-18	Code
3	InternImage-H	62.9	Yes	InternImage: Exploring Large-Scale Vision Founda...	2022-11-10	Code
4	M3I Pre-training (InternImage-H)	62.9	Yes	Towards All-in-one Pre-training via Maximizing M...	2022-11-17	Code
5	BEiT-3	62.8	Yes	Image as a Foreign Language: BEiT Pretraining fo...	2022-08-22	Code
6	EVA	62.3	Yes	EVA: Exploring the Limits of Masked Visual Repre...	2022-11-14	Code
7	ViT-P (OneFormer, InternImage-H)	61.6	No	The Missing Point in Vision Transformers for Uni...	2025-05-26	Code
8	ViT-Adapter-L (Mask2Former, BEiTv2 pretrain)	61.5	Yes	Vision Transformer Adapter for Dense Predictions	2022-05-17	Code
9	FD-SwinV2-G	61.4	No	Contrastive Learning Rivals Masked Image Modelin...	2022-05-27	Code
10	RevCol-H (Mask2Former)	61	Yes	Reversible Column Networks	2022-12-22	Code
11	MasK DINO (SwinL, multi-scale)	60.8	Yes	Mask DINO: Towards A Unified Transformer-based F...	2022-06-06	Code
12	ViT-Adapter-L (Mask2Former, BEiT pretrain)	60.5	Yes	Vision Transformer Adapter for Dense Predictions	2022-05-17	Code
13	DINOv2 (ViT-g/14 frozen model, w/ ViT-Adapter + Mask2former)	60.2	No	DINOv2: Learning Robust Visual Features without ...	2023-04-14	Code
14	ViT-P (OneFormer, DiNAT-L)	59.9	No	The Missing Point in Vision Transformers for Uni...	2025-05-26	Code
15	SwinV2-G(UperNet)	59.9	Yes	Swin Transformer V2: Scaling Up Capacity and Res...	2021-11-18	Code
16	PIIP-LH6B(UperNet)	59.9	No	Parameter-Inverted Image Pyramid Networks	2024-06-06	Code
17	SERNet-Former	59.35	No	SERNet-Former: Semantic Segmentation by Efficien...	2024-01-28	Code
18	FocalNet-L (Mask2Former)	58.5	Yes	Focal Modulation Networks	2022-03-22	Code
19	ViT-Adapter-L (UperNet, BEiT pretrain)	58.4	No	Vision Transformer Adapter for Dense Predictions	2022-05-17	Code
20	RSSeg-ViT-L (BEiT pretrain)	58.4	No	Representation Separation for Semantic Segmentat...	2022-12-28	-
21	EoMT (DINOv2-L, single-scale, 512x512)	58.4	No	Your ViT is Secretly an Image Segmentation Model	2025-03-24	Code
22	SegViT-v2 (BEiT-v2-Large)	58.2	No	SegViTv2: Exploring Efficient and Continual Sema...	2023-06-09	Code
23	SeMask (SeMask Swin-L FaPN-Mask2Former)	58.2	No	SeMask: Semantically Masked Transformers for Sem...	2021-12-23	Code
24	SeMask (SeMask Swin-L MSFaPN-Mask2Former)	58.2	No	SeMask: Semantically Masked Transformers for Sem...	2021-12-23	Code
25	DiNAT-L (Mask2Former)	58.1	No	Dilated Neighborhood Attention Transformer	2022-09-29	Code
26	HorNet-L (Mask2Former)	57.9	No	HorNet: Efficient High-Order Spatial Interaction...	2022-07-28	Code
27	Mask2Former (SwinL-FaPN)	57.7	No	Masked-attention Mask Transformer for Universal ...	2021-12-02	Code
28	FASeg (SwinL)	57.7	No	Dynamic Focus-aware Positional Queries for Seman...	2022-04-04	Code
29	RR (BEiT-L)	57.7	No	Region Rebalance for Long-Tailed Semantic Segmen...	2022-04-05	Code
30	MOAT-4 (IN-22K pretraining, single-scale)	57.6	No	MOAT: Alternating Mobile Convolution and Attenti...	2022-10-04	Code
31	Frozen Backbone, SwinV2-G-ext22K (Mask2Former)	57.6	No	Could Giant Pretrained Image Models Extract Univ...	2022-11-03	-
32	SeMask (SeMask Swin-L Mask2Former)	57.5	No	SeMask: Semantically Masked Transformers for Sem...	2021-12-23	Code
33	Mask2Former (SwinL)	57.3	No	Masked-attention Mask Transformer for Universal ...	2021-12-02	Code
34	SenFormer (BEiT-L)	57.1	Yes	Efficient Self-Ensemble for Semantic Segmentation	2021-11-26	Code
35	BEiT-L (ViT+UperNet)	57	No	BEiT: BERT Pre-Training of Image Transformers	2021-06-15	Code
36	SeMask(SeMask Swin-L MSFaPN-Mask2Former, single-scale)	57	No	SeMask: Semantically Masked Transformers for Sem...	2021-12-23	Code
37	MetaPrompt-SD	56.8	No	Harnessing Diffusion Models for Visual Perceptio...	2023-12-22	Code
38	FaPN (MaskFormer, Swin-L, ImageNet-22k pretrain)	56.7	No	FaPN: Feature-aligned Pyramid Network for Dense ...	2021-08-16	Code
39	MOAT-3 (IN-22K pretraining, single-scale)	56.5	No	MOAT: Alternating Mobile Convolution and Attenti...	2022-10-04	Code
40	Mask2Former (Swin-L-FaPN)	56.4	No	Masked-attention Mask Transformer for Universal ...	2021-12-02	Code
41	SeMask (SeMask Swin-L MaskFormer)	56.2	No	SeMask: Semantically Masked Transformers for Sem...	2021-12-23	Code
42	dBOT ViT-L (CLIP)	56.2	No	Exploring Target Representations for Masked Auto...	2022-09-08	Code
43	Mask2Former+CBL(Swin-B)	56.1	No	-	-	Code
44	TADP	55.9	No	Text-image Alignment for Diffusion-based Percept...	2023-09-29	Code
45	CSWin-L (UperNet, ImageNet-22k pretrain)	55.7	No	CSWin Transformer: A General Vision Transformer ...	2021-07-01	Code
46	UniRepLKNet-XL	55.6	No	UniRepLKNet: A Universal Perception Large-Kernel...	2023-11-27	Code
47	Focal-L (UperNet, ImageNet-22k pretrain)	55.4	No	Focal Self-attention for Local-Global Interactio...	2021-07-01	Code
48	InternImage-XL	55.3	No	InternImage: Exploring Large-Scale Vision Founda...	2022-11-10	Code
49	dBOT ViT-L	55.2	No	Exploring Target Representations for Masked Auto...	2022-09-08	Code
50	Mask2Former(Swin-B)	55.1	No	Masked-attention Mask Transformer for Universal ...	2021-12-02	Code
51	ConvNeXt V2-H (FCMAE)	55	No	ConvNeXt V2: Co-designing and Scaling ConvNets w...	2023-01-02	Code
52	UniRepLKNet-L++	55	No	UniRepLKNet: A Universal Perception Large-Kernel...	2023-11-27	Code
53	DiNAT-Large (UperNet)	54.9	No	Dilated Neighborhood Attention Transformer	2022-09-29	Code
54	MaskFormer+CBL(Swin-B)	54.9	No	-	-	Code
55	TransNeXt-Base (IN-1K pretrain, Mask2Former, 512)	54.7	No	TransNeXt: Robust Foveal Visual Perception for V...	2023-11-28	Code
56	MOAT-2 (IN-22K pretraining, single-scale)	54.7	No	MOAT: Alternating Mobile Convolution and Attenti...	2022-10-04	Code
57	CAE (ViT-L, UperNet)	54.7	No	Context Autoencoder for Self-Supervised Represen...	2022-02-07	Code
58	VAN-B6	54.7	No	Visual Attention Network	2022-02-20	Code
59	DiNAT_s-Large (UperNet)	54.6	No	Dilated Neighborhood Attention Transformer	2022-09-29	Code
60	DDP (Swin-L, step-3)	54.4	No	DDP: Diffusion Model for Dense Visual Prediction	2023-03-30	Code
61	PatchDiverse + Swin-L (multi-scale test, upernet, ImageNet22k pretrain)	54.4	No	Vision Transformers with Patch Diversification	2021-04-26	Code
62	VOLO-D5	54.3	No	VOLO: Vision Outlooker for Visual Recognition	2021-06-24	Code
63	K-Net	54.3	No	K-Net: Towards Unified Image Segmentation	2021-06-28	Code
64	GPaCo (Swin-L)	54.3	No	Generalized Parametric Contrastive Learning	2022-09-26	Code
65	SenFormer (Swin-L)	54.2	Yes	Efficient Self-Ensemble for Semantic Segmentation	2021-11-26	Code
66	Swin V2-H	54.2	No	ConvNeXt V2: Co-designing and Scaling ConvNets w...	2023-01-02	Code
67	InternImage-L	54.1	No	InternImage: Exploring Large-Scale Vision Founda...	2022-11-10	Code
68	TransNeXt-Small (IN-1K pretrain, Mask2Former, 512)	54.1	No	TransNeXt: Robust Foveal Visual Perception for V...	2023-11-28	Code
69	ConvNeXt-XL++	54	No	A ConvNet for the 2020s	2022-01-10	Code
70	Sequential Ensemble (SegFormer)	54	No	Sequential Ensembling for Semantic Segmentation	2022-10-08	-
71	MogaNet-XL (UperNet)	54	No	MogaNet: Multi-order Gated Aggregation Network	2022-11-07	Code
72	UniRepLKNet-B++	53.9	No	UniRepLKNet: A Universal Perception Large-Kernel...	2023-11-27	Code
73	MaskFormer(Swin-B)	53.8	No	Per-Pixel Classification is Not All You Need for...	2021-07-13	Code
74	ConvNeXt-L++	53.7	No	A ConvNet for the 2020s	2022-01-10	Code
75	SwinV2-G-HTC++ Liu et al. ([2021a])	53.7	No	Swin Transformer V2: Scaling Up Capacity and Res...	2021-11-18	Code
76	ConvNeXt V2-L	53.7	No	ConvNeXt V2: Co-designing and Scaling ConvNets w...	2023-01-02	Code
77	Seg-L-Mask/16 (MS)	53.63	No	Segmenter: Transformer for Semantic Segmentation	2021-05-12	Code
78	MAE (ViT-L, UperNet)	53.6	No	Masked Autoencoders Are Scalable Vision Learners	2021-11-11	Code
79	SeMask (SeMask Swin-L FPN)	53.52	No	SeMask: Semantically Masked Transformers for Sem...	2021-12-23	Code
80	Swin-L (UperNet, ImageNet-22k pretrain)	53.5	No	Swin Transformer: Hierarchical Vision Transforme...	2021-03-25	Code
81	Swin-L	53.5	No	ConvNeXt V2: Co-designing and Scaling ConvNets w...	2023-01-02	Code
82	TransNeXt-Tiny (IN-1K pretrain, Mask2Former, 512)	53.4	No	TransNeXt: Robust Foveal Visual Perception for V...	2023-11-28	Code
83	ConvNeXt-B++	53.1	No	A ConvNet for the 2020s	2022-01-10	Code
84	PatchConvNet-L120 (UperNet)	52.9	No	Augmenting Convolutional networks with attention...	2021-12-27	Code
85	dBOT ViT-B (CLIP)	52.9	No	Exploring Target Representations for Masked Auto...	2022-09-08	Code
86	PatchConvNet-B120 (UperNet)	52.8	No	Augmenting Convolutional networks with attention...	2021-12-27	Code
87	Swin-B	52.8	No	ConvNeXt V2: Co-designing and Scaling ConvNets w...	2023-01-02	Code
88	UniRepLKNet-S++	52.7	No	UniRepLKNet: A Universal Perception Large-Kernel...	2023-11-27	Code
89	ConvNeXt V2-B	52.1	No	ConvNeXt V2: Co-designing and Scaling ConvNets w...	2023-01-02	Code
90	DeBiFormer-B (IN1k pretrain, Upernet 160k)	52	No	DeBiFormer: Vision Transformer with Deformable A...	2024-10-11	Code
91	LV-ViT-L (UperNet, MS)	51.8	No	All Tokens Matter: Token Labeling for Training B...	2021-04-22	Code
92	SegFormer-B5	51.8	Yes	SegFormer: Simple and Efficient Design for Seman...	2021-05-31	Code
93	BiFormer-B (IN1k pretrain, Upernet 160k)	51.7	No	BiFormer: Vision Transformer with Bi-Level Routi...	2023-03-15	Code
94	ConvNeXt V2-L (Supervised)	51.6	No	ConvNeXt V2: Co-designing and Scaling ConvNets w...	2023-01-02	Code
95	Light-Ham (VAN-Huge)	51.5	No	Is Attention Better Than Matrix Decomposition?	2021-09-09	Code
96	DAT-B++	51.5	No	DAT++: Spatially Dynamic Vision Transformer with...	2023-09-04	Code
97	CrossFormer (ImageNet1k-pretrain, UPerNet, multi-scale test)	51.4	No	CrossFormer: A Versatile Vision Transformer Hing...	2021-07-31	Code
98	InternImage-B	51.3	No	InternImage: Exploring Large-Scale Vision Founda...	2022-11-10	Code
99	DAT-S++	51.2	No	DAT++: Spatially Dynamic Vision Transformer with...	2023-09-04	Code
100	ActiveMLP-L(UperNet)	51.1	No	Active Token Mixer	2022-03-11	Code
101	SegFormer-B4	51.1	Yes	SegFormer: Simple and Efficient Design for Seman...	2021-05-31	Code
102	PatchConvNet-B60 (UperNet)	51.1	No	Augmenting Convolutional networks with attention...	2021-12-27	Code
103	Light-Ham (VAN-Large)	51	No	Is Attention Better Than Matrix Decomposition?	2021-09-09	Code
104	TEC (Vit-B, Upernet)	51	No	Towards Sustainable Self-supervised Learning	2022-10-20	Code
105	UniRepLKNet-S	51	No	UniRepLKNet: A Universal Perception Large-Kernel...	2023-11-27	Code
106	SeMask (SeMask Swin-B FPN)	50.98	No	SeMask: Semantically Masked Transformers for Sem...	2021-12-23	Code
107	InternImage-S	50.9	No	InternImage: Exploring Large-Scale Vision Founda...	2022-11-10	Code
108	MogaNet-L (UperNet)	50.9	No	MogaNet: Multi-order Gated Aggregation Network	2022-11-07	Code
109	dBOT ViT-B	50.8	No	Exploring Target Representations for Masked Auto...	2022-09-08	Code
110	Upernet-BiFormer-S (IN1k pretrain, Upernet 160k)	50.8	No	BiFormer: Vision Transformer with Bi-Level Routi...	2023-03-15	Code
111	UperNet Shuffle-B	50.5	No	Shuffle Transformer: Rethinking Spatial Shuffle ...	2021-06-07	Code
112	ConvNeXt V1-L	50.5	No	ConvNeXt V2: Co-designing and Scaling ConvNets w...	2023-01-02	Code
113	DiNAT-Base (UperNet)	50.4	No	Dilated Neighborhood Attention Transformer	2022-09-29	Code
114	ELSA-Swin-S	50.3	No	ELSA: Enhanced Local Self-Attention for Vision T...	2021-12-23	Code
115	DAT-T++	50.3	No	DAT++: Spatially Dynamic Vision Transformer with...	2023-09-04	Code
116	SETR-MLA (160k, MS)	50.28	No	Rethinking Semantic Segmentation from a Sequence...	2020-12-31	Code
117	VAN-Large (HamNet)	50.2	No	Visual Attention Network	2022-02-20	Code
118	HRViT-b3 (SegFormer, SS)	50.2	No	Multi-Scale High-Resolution Vision Transformer f...	2021-11-01	Code
119	Twins-SVT-L (UperNet, ImageNet-1k pretrain)	50.2	No	Twins: Revisiting the Design of Spatial Attentio...	2021-04-28	Code
120	MogaNet-B (UperNet)	50.1	No	MogaNet: Multi-order Gated Aggregation Network	2022-11-07	Code
121	Seg-B-Mask/16(MS, ViT-B)	50	No	Segmenter: Transformer for Semantic Segmentation	2021-05-12	Code
122	iBOT (ViT-B/16)	50	No	iBOT: Image BERT Pre-Training with Online Tokeni...	2021-11-15	Code
123	ConvNeXt-B	49.9	No	A ConvNet for the 2020s	2022-01-10	Code
124	DiNAT-Small (UperNet)	49.9	No	Dilated Neighborhood Attention Transformer	2022-09-29	Code
125	ConvNeXt V1-B	49.9	No	ConvNeXt V2: Co-designing and Scaling ConvNets w...	2023-01-02	Code
126	NAT-Base	49.7	No	Neighborhood Attention Transformer	2022-04-14	Code
127	Swin-B (UperNet, ImageNet-1k pretrain)	49.7	No	Swin Transformer: Hierarchical Vision Transforme...	2021-03-25	Code
128	Seg-B/8 (MS, ViT-B)	49.61	No	Segmenter: Transformer for Semantic Segmentation	2021-05-12	Code
129	ConvNeXt-S	49.6	No	A ConvNet for the 2020s	2022-01-10	Code
130	Light-Ham (VAN-Base)	49.6	No	Is Attention Better Than Matrix Decomposition?	2021-09-09	Code
131	NAT-Small	49.5	No	Neighborhood Attention Transformer	2022-04-14	Code
132	DaViT-B	49.4	No	DaViT: Dual Attention Vision Transformers	2022-04-07	Code
133	DAT-B (UperNet)	49.38	No	Vision Transformer with Deformable Attention	2022-01-03	Code
134	PatchConvNet-S60 (UperNet)	49.3	No	Augmenting Convolutional networks with attention...	2021-12-27	Code
135	ColorMAE-Green-ViTB-1600	49.3	No	ColorMAE: Exploring data-independent masking str...	2024-07-17	Code
136	MogaNet-S (UperNet)	49.2	No	MogaNet: Multi-order Gated Aggregation Network	2022-11-07	Code
137	Shift-B (UperNet)	49.2	No	When Shift Operation Meets Vision Transformer: A...	2022-01-26	Code
138	UniRepLKNet-T	49.1	No	UniRepLKNet: A Universal Perception Large-Kernel...	2023-11-27	Code
139	DPT-Hybrid	49.02	No	Vision Transformers for Dense Prediction	2021-03-24	Code
140	GC ViT-B	49	No	Global Context Vision Transformers	2022-06-20	Code
141	A2MIM (ViT-B)	49	No	Architecture-Agnostic Masked Image Modeling -- F...	2022-05-27	Code
142	EfficientViT-B3 (r512)	49	No	EfficientViT: Multi-Scale Linear Attention for H...	2022-05-29	Code
143	DiNAT-Tiny (UperNet)	48.8	No	Dilated Neighborhood Attention Transformer	2022-09-29	Code
144	HRViT-b2 (SegFormer, SS)	48.76	No	Multi-Scale High-Resolution Vision Transformer f...	2021-11-01	Code
145	NAT-Tiny	48.4	No	Neighborhood Attention Transformer	2022-04-14	Code
146	XCiT-M24/8 (UperNet)	48.4	No	XCiT: Cross-Covariance Image Transformers	2021-06-17	Code
147	ResNeSt-200	48.36	No	ResNeSt: Split-Attention Networks	2020-04-19	Code
148	DAT-S (UperNet)	48.31	No	Vision Transformer with Deformable Attention	2022-01-03	Code
149	GC ViT-S	48.3	No	Global Context Vision Transformers	2022-06-20	Code
150	InternImage-T	48.1	No	InternImage: Exploring Large-Scale Vision Founda...	2022-11-10	Code
151	VAN-Large	48.1	No	Visual Attention Network	2022-02-20	Code
152	XCiT-S24/8 (UperNet)	48.1	No	XCiT: Cross-Covariance Image Transformers	2021-06-17	Code
153	MaskFormer(ResNet-101)	48.1	No	Per-Pixel Classification is Not All You Need for...	2021-07-13	Code
154	MAE (ViT-B, UperNet)	48.1	No	Masked Autoencoders Are Scalable Vision Learners	2021-11-11	Code
155	HRNetV2 + OCR + RMI (PaddleClas pretrained)	47.98	No	Segmentation Transformer: Object-Contextual Repr...	2019-09-24	Code
156	Shift-B	47.9	No	When Shift Operation Meets Vision Transformer: A...	2022-01-26	Code
157	Shift-S	47.8	No	When Shift Operation Meets Vision Transformer: A...	2022-01-26	Code
158	MogaNet-S (Semantic FPN)	47.7	No	MogaNet: Multi-order Gated Aggregation Network	2022-11-07	Code
159	SeMask (SeMask Swin-S FPN)	47.63	No	SeMask: Semantically Masked Transformers for Sem...	2021-12-23	Code
160	ResNeSt-269	47.6	No	ResNeSt: Split-Attention Networks	2020-04-19	Code
161	UperNet Shuffle-T	47.6	No	Shuffle Transformer: Rethinking Spatial Shuffle ...	2021-06-07	Code
162	CondNet(ResNest-101)	47.54	No	CondNet: Conditional Classifier for Scene Segmen...	2021-09-21	Code
163	tiny-MOAT-3 (IN-1K pretraining, single scale)	47.5	No	MOAT: Alternating Mobile Convolution and Attenti...	2022-10-04	Code
164	CondNet(ResNet-101)	47.38	No	CondNet: Conditional Classifier for Scene Segmen...	2021-09-21	Code
165	DiNAT-Mini (UperNet)	47.2	No	Dilated Neighborhood Attention Transformer	2022-09-29	Code
166	DCNAS	47.12	No	DCNAS: Densely Connected Neural Architecture Sea...	2020-03-26	-
167	XCiT-S24/8 (Semantic-FPN)	47.1	No	XCiT: Cross-Covariance Image Transformers	2021-06-17	Code
168	ResNeSt-101	46.91	No	ResNeSt: Split-Attention Networks	2020-04-19	Code
169	XCiT-M24/8 (Semantic-FPN)	46.9	No	XCiT: Cross-Covariance Image Transformers	2021-06-17	Code
170	HamNet (ResNet-101)	46.8	No	Is Attention Better Than Matrix Decomposition?	2021-09-09	Code
171	Sequential Ensemble (DeepLabv3+)	46.8	No	Sequential Ensembling for Semantic Segmentation	2022-10-08	-
172	ConvNeXt-T	46.7	No	A ConvNet for the 2020s	2022-01-10	Code
173	VAN-Base (Semantic-FPN)	46.7	No	Visual Attention Network	2022-02-20	Code
174	XCiT-S12/8 (UperNet)	46.6	No	XCiT: Cross-Covariance Image Transformers	2021-06-17	Code
175	GC ViT-T	46.5	No	Global Context Vision Transformers	2022-06-20	Code
176	NAT-Mini	46.4	No	Neighborhood Attention Transformer	2022-04-14	Code
177	Shift-T	46.3	No	When Shift Operation Meets Vision Transformer: A...	2022-01-26	Code
178	DaViT-T	46.3	No	DaViT: Dual Attention Vision Transformers	2022-04-07	Code
179	CPN(ResNet-101)	46.27	No	Context Prior for Scene Segmentation	2020-04-03	Code
180	MultiMAE (ViT-B)	46.2	No	MultiMAE: Multi-modal Multi-task Masked Autoenco...	2022-04-04	Code
181	DRAN(ResNet-101)	46.18	No	-	-	Code
182	PyConvSegNet-152	45.99	No	Pyramidal Convolution: Rethinking Convolutional ...	2020-06-20	Code
183	DNL	45.97	No	Disentangled Non-Local Neural Networks	2020-06-11	Code
184	ACNet (ResNet-101)	45.9	No	Adaptive Context Network for Scene Parsing	2019-11-05	-
185	ACNet (ResNet-101)	45.9	No	Adaptive Context Network for Scene Parsing	2019-11-05	-
186	HRViT-b1 (SegFormer, SS)	45.88	No	Multi-Scale High-Resolution Vision Transformer f...	2021-11-01	Code
187	OCR(HRNetV2-W48)	45.66	No	Segmentation Transformer: Object-Contextual Repr...	2019-09-24	Code
188	SPNet (ResNet-101)	45.6	No	Strip Pooling: Rethinking Spatial Pooling for Sc...	2020-03-30	Code
189	Swin-T (UPerNet) MoBY	45.58	No	Self-Supervised Learning with Swin Transformers	2021-05-10	Code
190	DAT-T (UperNet)	45.54	No	Vision Transformer with Deformable Attention	2022-01-03	Code
191	iBOT (ViT-S/16)	45.4	No	iBOT: Image BERT Pre-Training with Online Tokeni...	2021-11-15	Code
192	EANet (ResNet-101)	45.33	No	Beyond Self-attention: External Attention using ...	2021-05-05	Code
193	OCR (ResNet-101)	45.28	No	Segmentation Transformer: Object-Contextual Repr...	2019-09-24	Code
194	Asymmetric ALNN	45.24	No	Asymmetric Non-local Neural Networks for Semanti...	2019-08-21	Code
195	Light-Ham (VAN-Small, D=256)	45.2	No	Is Attention Better Than Matrix Decomposition?	2021-09-09	Code
196	LaU-regression-loss	45.02	No	Location-aware Upsampling for Semantic Segmentat...	2019-11-13	Code
197	PSPNet	44.94	No	Pyramid Scene Parsing Network	2016-12-04	Code
198	tiny-MOAT-2 (IN-1K pretraining, single scale)	44.9	No	MOAT: Alternating Mobile Convolution and Attenti...	2022-10-04	Code
199	CFNet(ResNet-101)	44.89	No	-	-	Code
200	EncNet	44.65	No	Context Encoding for Semantic Segmentation	2018-03-23	Code
201	LaU-offset-loss	44.55	No	Location-aware Upsampling for Semantic Segmentat...	2019-11-13	Code
202	EncNet + JPU	44.34	No	FastFCN: Rethinking Dilated Convolution in the B...	2019-03-28	Code
203	SGR (ResNet-101)	44.32	No	-	-	Code
204	XCiT-S12/8 (Semantic-FPN)	44.2	No	XCiT: Cross-Covariance Image Transformers	2021-06-17	Code
205	Auto-DeepLab-L	43.98	No	Auto-DeepLab: Hierarchical Neural Architecture S...	2019-01-10	Code
206	PSANet (ResNet-101)	43.77	No	-	-	Code
207	DSSPN (ResNet-101)	43.68	No	Dynamic-structured Semantic Propagation Network	2018-03-16	-
208	PSPNet (ResNet-152)	43.51	No	Pyramid Scene Parsing Network	2016-12-04	Code
209	PSPNet (ResNet-101)	43.29	No	Pyramid Scene Parsing Network	2016-12-04	Code
210	HRNetV2	43.2	No	High-Resolution Representations for Labeling Pix...	2019-04-09	Code
211	SeMask (SeMask Swin-T FPN)	43.16	No	SeMask: Semantically Masked Transformers for Sem...	2021-12-23	Code
212	tiny-MOAT-1 (IN-1K pretraining, single scale)	43.1	No	MOAT: Alternating Mobile Convolution and Attenti...	2022-10-04	Code
213	VAN-Small	42.9	No	Visual Attention Network	2022-02-20	Code
214	PoolFormer-M48	42.7	No	MetaFormer Is Actually What You Need for Vision	2021-11-22	Code
215	UperNet (ResNet-101)	42.66	No	Unified Perceptual Parsing for Scene Understanding	2018-07-26	Code
216	tiny-MOAT-0 (IN-1K pretraining, single scale)	41.2	No	MOAT: Alternating Mobile Convolution and Attenti...	2022-10-04	Code
217	RefineNet	40.7	No	RefineNet: Multi-Path Refinement Networks for Hi...	2016-11-20	Code
218	FBNetV5	40.4	No	FBNetV5: Neural Architecture Search for Multiple...	2021-11-19	-
219	ConvMLP-L	40	No	ConvMLP: Hierarchical Convolutional MLPs for Vis...	2021-09-09	Code
220	ConvMLP-M	38.6	No	ConvMLP: Hierarchical Convolutional MLPs for Vis...	2021-09-09	Code
221	VAN-Tiny	38.5	No	Visual Attention Network	2022-02-20	Code
222	A2MIM (ResNet-50)	38.3	No	Architecture-Agnostic Masked Image Modeling -- F...	2022-05-27	Code
223	iBOT (ViT-B/16) (linear head)	38.3	No	iBOT: Image BERT Pre-Training with Online Tokeni...	2021-11-15	Code
224	SegFormer-B0	37.4	Yes	SegFormer: Simple and Efficient Design for Seman...	2021-05-31	Code
225	MUXNet-m + PPM	35.8	No	MUXConv: Information Multiplexing in Convolution...	2020-03-31	Code
226	ConvMLP-S	35.8	No	ConvMLP: Hierarchical Convolutional MLPs for Vis...	2021-09-09	Code
227	MUXNet-m + C1	32.42	No	MUXConv: Information Multiplexing in Convolution...	2020-03-31	Code
228	DilatedNet	32.31	No	Multi-Scale Context Aggregation by Dilated Convo...	2015-11-23	Code
229	FCN	29.39	Yes	Fully Convolutional Networks for Semantic Segmen...	2014-11-14	Code
230	SegNet	21.64	No	SegNet: A Deep Convolutional Encoder-Decoder Arc...	2015-11-02	Code

#1ViT-P (InternImage-H)SOTA
63.6
Validation mIoU· Extra Data· 2025-05-26
The Missing Point in Vision Transformers for Universal Image Segmentation Code
#2ONE-PEACESOTA
63
Validation mIoU· Extra Data· 2023-05-18
ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities Code
#3InternImage-HSOTA
62.9
Validation mIoU· Extra Data· 2022-11-10
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions Code
#4M3I Pre-training (InternImage-H)
62.9
Validation mIoU· Extra Data· 2022-11-17
Towards All-in-one Pre-training via Maximizing Multi-modal Mutual Information Code
#5BEiT-3SOTA
62.8
Validation mIoU· Extra Data· 2022-08-22
Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks Code
#6EVA
62.3
Validation mIoU· Extra Data· 2022-11-14
EVA: Exploring the Limits of Masked Visual Representation Learning at Scale Code
#7ViT-P (OneFormer, InternImage-H)
61.6
Validation mIoU· 2025-05-26
The Missing Point in Vision Transformers for Universal Image Segmentation Code
#8ViT-Adapter-L (Mask2Former, BEiTv2 pretrain)SOTA
61.5
Validation mIoU· Extra Data· 2022-05-17
Vision Transformer Adapter for Dense Predictions Code
#9FD-SwinV2-G
61.4
Validation mIoU· 2022-05-27
Contrastive Learning Rivals Masked Image Modeling in Fine-tuning via Feature Distillation Code
#10RevCol-H (Mask2Former)
61
Validation mIoU· Extra Data· 2022-12-22
Reversible Column Networks Code
#11MasK DINO (SwinL, multi-scale)
60.8
Validation mIoU· Extra Data· 2022-06-06
Mask DINO: Towards A Unified Transformer-based Framework for Object Detection and Segmentation Code
#12ViT-Adapter-L (Mask2Former, BEiT pretrain)
60.5
Validation mIoU· Extra Data· 2022-05-17
Vision Transformer Adapter for Dense Predictions Code
#13DINOv2 (ViT-g/14 frozen model, w/ ViT-Adapter + Mask2former)
60.2
Validation mIoU· 2023-04-14
DINOv2: Learning Robust Visual Features without Supervision Code
#14ViT-P (OneFormer, DiNAT-L)
59.9
Validation mIoU· 2025-05-26
The Missing Point in Vision Transformers for Universal Image Segmentation Code
#15SwinV2-G(UperNet)SOTA
59.9
Validation mIoU· Extra Data· 2021-11-18
Swin Transformer V2: Scaling Up Capacity and Resolution Code
#16PIIP-LH6B(UperNet)
59.9
Validation mIoU· 2024-06-06
Parameter-Inverted Image Pyramid Networks Code
#17SERNet-Former
59.35
Validation mIoU· 2024-01-28
SERNet-Former: Semantic Segmentation by Efficient Residual Network with Attention-Boosting Gates and Attention-Fusion Networks Code
#18FocalNet-L (Mask2Former)
58.5
Validation mIoU· Extra Data· 2022-03-22
Focal Modulation Networks Code
#19ViT-Adapter-L (UperNet, BEiT pretrain)
58.4
Validation mIoU· 2022-05-17
Vision Transformer Adapter for Dense Predictions Code
#20RSSeg-ViT-L (BEiT pretrain)
58.4
Validation mIoU· 2022-12-28
Representation Separation for Semantic Segmentation with Vision Transformers
#21EoMT (DINOv2-L, single-scale, 512x512)
58.4
Validation mIoU· 2025-03-24
Your ViT is Secretly an Image Segmentation Model Code
#22SegViT-v2 (BEiT-v2-Large)
58.2
Validation mIoU· 2023-06-09
SegViTv2: Exploring Efficient and Continual Semantic Segmentation with Plain Vision Transformers Code
#23SeMask (SeMask Swin-L FaPN-Mask2Former)
58.2
Validation mIoU· 2021-12-23
SeMask: Semantically Masked Transformers for Semantic Segmentation Code
#24SeMask (SeMask Swin-L MSFaPN-Mask2Former)
58.2
Validation mIoU· 2021-12-23
SeMask: Semantically Masked Transformers for Semantic Segmentation Code
#25DiNAT-L (Mask2Former)
58.1
Validation mIoU· 2022-09-29
Dilated Neighborhood Attention Transformer Code
#26HorNet-L (Mask2Former)
57.9
Validation mIoU· 2022-07-28
HorNet: Efficient High-Order Spatial Interactions with Recursive Gated Convolutions Code
#27Mask2Former (SwinL-FaPN)
57.7
Validation mIoU· 2021-12-02
Masked-attention Mask Transformer for Universal Image Segmentation Code
#28FASeg (SwinL)
57.7
Validation mIoU· 2022-04-04
Dynamic Focus-aware Positional Queries for Semantic Segmentation Code
#29RR (BEiT-L)
57.7
Validation mIoU· 2022-04-05
Region Rebalance for Long-Tailed Semantic Segmentation Code
#30MOAT-4 (IN-22K pretraining, single-scale)
57.6
Validation mIoU· 2022-10-04
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models Code
#31Frozen Backbone, SwinV2-G-ext22K (Mask2Former)
57.6
Validation mIoU· 2022-11-03
Could Giant Pretrained Image Models Extract Universal Representations?
#32SeMask (SeMask Swin-L Mask2Former)
57.5
Validation mIoU· 2021-12-23
SeMask: Semantically Masked Transformers for Semantic Segmentation Code
#33Mask2Former (SwinL)
57.3
Validation mIoU· 2021-12-02
Masked-attention Mask Transformer for Universal Image Segmentation Code
#34SenFormer (BEiT-L)
57.1
Validation mIoU· Extra Data· 2021-11-26
Efficient Self-Ensemble for Semantic Segmentation Code
#35BEiT-L (ViT+UperNet)SOTA
57
Validation mIoU· 2021-06-15
BEiT: BERT Pre-Training of Image Transformers Code
#36SeMask(SeMask Swin-L MSFaPN-Mask2Former, single-scale)
57
Validation mIoU· 2021-12-23
SeMask: Semantically Masked Transformers for Semantic Segmentation Code
#37MetaPrompt-SD
56.8
Validation mIoU· 2023-12-22
Harnessing Diffusion Models for Visual Perception with Meta Prompts Code
#38FaPN (MaskFormer, Swin-L, ImageNet-22k pretrain)
56.7
Validation mIoU· 2021-08-16
FaPN: Feature-aligned Pyramid Network for Dense Image Prediction Code
#39MOAT-3 (IN-22K pretraining, single-scale)
56.5
Validation mIoU· 2022-10-04
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models Code
#40Mask2Former (Swin-L-FaPN)
56.4
Validation mIoU· 2021-12-02
Masked-attention Mask Transformer for Universal Image Segmentation Code
#41SeMask (SeMask Swin-L MaskFormer)
56.2
Validation mIoU· 2021-12-23
SeMask: Semantically Masked Transformers for Semantic Segmentation Code
#42dBOT ViT-L (CLIP)
56.2
Validation mIoU· 2022-09-08
Exploring Target Representations for Masked Autoencoders Code
#43Mask2Former+CBL(Swin-B)
56.1
Validation mIoU
No paperCode
#44TADP
55.9
Validation mIoU· 2023-09-29
Text-image Alignment for Diffusion-based Perception Code
#45CSWin-L (UperNet, ImageNet-22k pretrain)
55.7
Validation mIoU· 2021-07-01
CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows Code
#46UniRepLKNet-XL
55.6
Validation mIoU· 2023-11-27
UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition Code
#47Focal-L (UperNet, ImageNet-22k pretrain)
55.4
Validation mIoU· 2021-07-01
Focal Self-attention for Local-Global Interactions in Vision Transformers Code
#48InternImage-XL
55.3
Validation mIoU· 2022-11-10
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions Code
#49dBOT ViT-L
55.2
Validation mIoU· 2022-09-08
Exploring Target Representations for Masked Autoencoders Code
#50Mask2Former(Swin-B)
55.1
Validation mIoU· 2021-12-02
Masked-attention Mask Transformer for Universal Image Segmentation Code
#51ConvNeXt V2-H (FCMAE)
55
Validation mIoU· 2023-01-02
ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders Code
#52UniRepLKNet-L++
55
Validation mIoU· 2023-11-27
UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition Code
#53DiNAT-Large (UperNet)
54.9
Validation mIoU· 2022-09-29
Dilated Neighborhood Attention Transformer Code
#54MaskFormer+CBL(Swin-B)
54.9
Validation mIoU
No paperCode
#55TransNeXt-Base (IN-1K pretrain, Mask2Former, 512)
54.7
Validation mIoU· 2023-11-28
TransNeXt: Robust Foveal Visual Perception for Vision Transformers Code
#56MOAT-2 (IN-22K pretraining, single-scale)
54.7
Validation mIoU· 2022-10-04
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models Code
#57CAE (ViT-L, UperNet)
54.7
Validation mIoU· 2022-02-07
Context Autoencoder for Self-Supervised Representation Learning Code
#58VAN-B6
54.7
Validation mIoU· 2022-02-20
Visual Attention Network Code
#59DiNAT_s-Large (UperNet)
54.6
Validation mIoU· 2022-09-29
Dilated Neighborhood Attention Transformer Code
#60DDP (Swin-L, step-3)
54.4
Validation mIoU· 2023-03-30
DDP: Diffusion Model for Dense Visual Prediction Code
#61PatchDiverse + Swin-L (multi-scale test, upernet, ImageNet22k pretrain)SOTA
54.4
Validation mIoU· 2021-04-26
Vision Transformers with Patch Diversification Code
#62VOLO-D5
54.3
Validation mIoU· 2021-06-24
VOLO: Vision Outlooker for Visual Recognition Code
#63K-Net
54.3
Validation mIoU· 2021-06-28
K-Net: Towards Unified Image Segmentation Code
#64GPaCo (Swin-L)
54.3
Validation mIoU· 2022-09-26
Generalized Parametric Contrastive Learning Code
#65SenFormer (Swin-L)
54.2
Validation mIoU· Extra Data· 2021-11-26
Efficient Self-Ensemble for Semantic Segmentation Code
#66Swin V2-H
54.2
Validation mIoU· 2023-01-02
ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders Code
#67InternImage-L
54.1
Validation mIoU· 2022-11-10
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions Code
#68TransNeXt-Small (IN-1K pretrain, Mask2Former, 512)
54.1
Validation mIoU· 2023-11-28
TransNeXt: Robust Foveal Visual Perception for Vision Transformers Code
#69ConvNeXt-XL++
54
Validation mIoU· 2022-01-10
A ConvNet for the 2020s Code
#70Sequential Ensemble (SegFormer)
54
Validation mIoU· 2022-10-08
Sequential Ensembling for Semantic Segmentation
#71MogaNet-XL (UperNet)
54
Validation mIoU· 2022-11-07
MogaNet: Multi-order Gated Aggregation Network Code
#72UniRepLKNet-B++
53.9
Validation mIoU· 2023-11-27
UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition Code
#73MaskFormer(Swin-B)
53.8
Validation mIoU· 2021-07-13
Per-Pixel Classification is Not All You Need for Semantic Segmentation Code
#74ConvNeXt-L++
53.7
Validation mIoU· 2022-01-10
A ConvNet for the 2020s Code
#75SwinV2-G-HTC++ Liu et al. ([2021a])
53.7
Validation mIoU· 2021-11-18
Swin Transformer V2: Scaling Up Capacity and Resolution Code
#76ConvNeXt V2-L
53.7
Validation mIoU· 2023-01-02
ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders Code
#77Seg-L-Mask/16 (MS)
53.63
Validation mIoU· 2021-05-12
Segmenter: Transformer for Semantic Segmentation Code
#78MAE (ViT-L, UperNet)
53.6
Validation mIoU· 2021-11-11
Masked Autoencoders Are Scalable Vision Learners Code
#79SeMask (SeMask Swin-L FPN)
53.52
Validation mIoU· 2021-12-23
SeMask: Semantically Masked Transformers for Semantic Segmentation Code
#80Swin-L (UperNet, ImageNet-22k pretrain)SOTA
53.5
Validation mIoU· 2021-03-25
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows Code
#81Swin-L
53.5
Validation mIoU· 2023-01-02
ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders Code
#82TransNeXt-Tiny (IN-1K pretrain, Mask2Former, 512)
53.4
Validation mIoU· 2023-11-28
TransNeXt: Robust Foveal Visual Perception for Vision Transformers Code
#83ConvNeXt-B++
53.1
Validation mIoU· 2022-01-10
A ConvNet for the 2020s Code
#84PatchConvNet-L120 (UperNet)
52.9
Validation mIoU· 2021-12-27
Augmenting Convolutional networks with attention-based aggregation Code
#85dBOT ViT-B (CLIP)
52.9
Validation mIoU· 2022-09-08
Exploring Target Representations for Masked Autoencoders Code
#86PatchConvNet-B120 (UperNet)
52.8
Validation mIoU· 2021-12-27
Augmenting Convolutional networks with attention-based aggregation Code
#87Swin-B
52.8
Validation mIoU· 2023-01-02
ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders Code
#88UniRepLKNet-S++
52.7
Validation mIoU· 2023-11-27
UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition Code
#89ConvNeXt V2-B
52.1
Validation mIoU· 2023-01-02
ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders Code
#90DeBiFormer-B (IN1k pretrain, Upernet 160k)
52
Validation mIoU· 2024-10-11
DeBiFormer: Vision Transformer with Deformable Agent Bi-level Routing Attention Code
#91LV-ViT-L (UperNet, MS)
51.8
Validation mIoU· 2021-04-22
All Tokens Matter: Token Labeling for Training Better Vision Transformers Code
#92SegFormer-B5
51.8
Validation mIoU· Extra Data· 2021-05-31
SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers Code
#93BiFormer-B (IN1k pretrain, Upernet 160k)
51.7
Validation mIoU· 2023-03-15
BiFormer: Vision Transformer with Bi-Level Routing Attention Code
#94ConvNeXt V2-L (Supervised)
51.6
Validation mIoU· 2023-01-02
ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders Code
#95Light-Ham (VAN-Huge)
51.5
Validation mIoU· 2021-09-09
Is Attention Better Than Matrix Decomposition?Code
#96DAT-B++
51.5
Validation mIoU· 2023-09-04
DAT++: Spatially Dynamic Vision Transformer with Deformable Attention Code
#97CrossFormer (ImageNet1k-pretrain, UPerNet, multi-scale test)
51.4
Validation mIoU· 2021-07-31
CrossFormer: A Versatile Vision Transformer Hinging on Cross-scale Attention Code
#98InternImage-B
51.3
Validation mIoU· 2022-11-10
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions Code
#99DAT-S++
51.2
Validation mIoU· 2023-09-04
DAT++: Spatially Dynamic Vision Transformer with Deformable Attention Code
#100ActiveMLP-L(UperNet)
51.1
Validation mIoU· 2022-03-11
Active Token Mixer Code
#101SegFormer-B4
51.1
Validation mIoU· Extra Data· 2021-05-31
SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers Code
#102PatchConvNet-B60 (UperNet)
51.1
Validation mIoU· 2021-12-27
Augmenting Convolutional networks with attention-based aggregation Code
#103Light-Ham (VAN-Large)
51
Validation mIoU· 2021-09-09
Is Attention Better Than Matrix Decomposition?Code
#104TEC (Vit-B, Upernet)
51
Validation mIoU· 2022-10-20
Towards Sustainable Self-supervised Learning Code
#105UniRepLKNet-S
51
Validation mIoU· 2023-11-27
UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition Code
#106SeMask (SeMask Swin-B FPN)
50.98
Validation mIoU· 2021-12-23
SeMask: Semantically Masked Transformers for Semantic Segmentation Code
#107InternImage-S
50.9
Validation mIoU· 2022-11-10
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions Code
#108MogaNet-L (UperNet)
50.9
Validation mIoU· 2022-11-07
MogaNet: Multi-order Gated Aggregation Network Code
#109dBOT ViT-B
50.8
Validation mIoU· 2022-09-08
Exploring Target Representations for Masked Autoencoders Code
#110Upernet-BiFormer-S (IN1k pretrain, Upernet 160k)
50.8
Validation mIoU· 2023-03-15
BiFormer: Vision Transformer with Bi-Level Routing Attention Code
#111UperNet Shuffle-B
50.5
Validation mIoU· 2021-06-07
Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer Code
#112ConvNeXt V1-L
50.5
Validation mIoU· 2023-01-02
ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders Code
#113DiNAT-Base (UperNet)
50.4
Validation mIoU· 2022-09-29
Dilated Neighborhood Attention Transformer Code
#114ELSA-Swin-S
50.3
Validation mIoU· 2021-12-23
ELSA: Enhanced Local Self-Attention for Vision Transformer Code
#115DAT-T++
50.3
Validation mIoU· 2023-09-04
DAT++: Spatially Dynamic Vision Transformer with Deformable Attention Code
#116SETR-MLA (160k, MS)SOTA
50.28
Validation mIoU· 2020-12-31
Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers Code
#117VAN-Large (HamNet)
50.2
Validation mIoU· 2022-02-20
Visual Attention Network Code
#118HRViT-b3 (SegFormer, SS)
50.2
Validation mIoU· 2021-11-01
Multi-Scale High-Resolution Vision Transformer for Semantic Segmentation Code
#119Twins-SVT-L (UperNet, ImageNet-1k pretrain)
50.2
Validation mIoU· 2021-04-28
Twins: Revisiting the Design of Spatial Attention in Vision Transformers Code
#120MogaNet-B (UperNet)
50.1
Validation mIoU· 2022-11-07
MogaNet: Multi-order Gated Aggregation Network Code
#121Seg-B-Mask/16(MS, ViT-B)
50
Validation mIoU· 2021-05-12
Segmenter: Transformer for Semantic Segmentation Code
#122iBOT (ViT-B/16)
50
Validation mIoU· 2021-11-15
iBOT: Image BERT Pre-Training with Online Tokenizer Code
#123ConvNeXt-B
49.9
Validation mIoU· 2022-01-10
A ConvNet for the 2020s Code
#124DiNAT-Small (UperNet)
49.9
Validation mIoU· 2022-09-29
Dilated Neighborhood Attention Transformer Code
#125ConvNeXt V1-B
49.9
Validation mIoU· 2023-01-02
ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders Code
#126NAT-Base
49.7
Validation mIoU· 2022-04-14
Neighborhood Attention Transformer Code
#127Swin-B (UperNet, ImageNet-1k pretrain)
49.7
Validation mIoU· 2021-03-25
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows Code
#128Seg-B/8 (MS, ViT-B)
49.61
Validation mIoU· 2021-05-12
Segmenter: Transformer for Semantic Segmentation Code
#129ConvNeXt-S
49.6
Validation mIoU· 2022-01-10
A ConvNet for the 2020s Code
#130Light-Ham (VAN-Base)
49.6
Validation mIoU· 2021-09-09
Is Attention Better Than Matrix Decomposition?Code
#131NAT-Small
49.5
Validation mIoU· 2022-04-14
Neighborhood Attention Transformer Code
#132DaViT-B
49.4
Validation mIoU· 2022-04-07
DaViT: Dual Attention Vision Transformers Code
#133DAT-B (UperNet)
49.38
Validation mIoU· 2022-01-03
Vision Transformer with Deformable Attention Code
#134PatchConvNet-S60 (UperNet)
49.3
Validation mIoU· 2021-12-27
Augmenting Convolutional networks with attention-based aggregation Code
#135ColorMAE-Green-ViTB-1600
49.3
Validation mIoU· 2024-07-17
ColorMAE: Exploring data-independent masking strategies in Masked AutoEncoders Code
#136MogaNet-S (UperNet)
49.2
Validation mIoU· 2022-11-07
MogaNet: Multi-order Gated Aggregation Network Code
#137Shift-B (UperNet)
49.2
Validation mIoU· 2022-01-26
When Shift Operation Meets Vision Transformer: An Extremely Simple Alternative to Attention Mechanism Code
#138UniRepLKNet-T
49.1
Validation mIoU· 2023-11-27
UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition Code
#139DPT-Hybrid
49.02
Validation mIoU· 2021-03-24
Vision Transformers for Dense Prediction Code
#140GC ViT-B
49
Validation mIoU· 2022-06-20
Global Context Vision Transformers Code
#141A2MIM (ViT-B)
49
Validation mIoU· 2022-05-27
Architecture-Agnostic Masked Image Modeling -- From ViT back to CNN Code
#142EfficientViT-B3 (r512)
49
Validation mIoU· 2022-05-29
EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense Prediction Code
#143DiNAT-Tiny (UperNet)
48.8
Validation mIoU· 2022-09-29
Dilated Neighborhood Attention Transformer Code
#144HRViT-b2 (SegFormer, SS)
48.76
Validation mIoU· 2021-11-01
Multi-Scale High-Resolution Vision Transformer for Semantic Segmentation Code
#145NAT-Tiny
48.4
Validation mIoU· 2022-04-14
Neighborhood Attention Transformer Code
#146XCiT-M24/8 (UperNet)
48.4
Validation mIoU· 2021-06-17
XCiT: Cross-Covariance Image Transformers Code
#147ResNeSt-200SOTA
48.36
Validation mIoU· 2020-04-19
ResNeSt: Split-Attention Networks Code
#148DAT-S (UperNet)
48.31
Validation mIoU· 2022-01-03
Vision Transformer with Deformable Attention Code
#149GC ViT-S
48.3
Validation mIoU· 2022-06-20
Global Context Vision Transformers Code
#150InternImage-T
48.1
Validation mIoU· 2022-11-10
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions Code
#151VAN-Large
48.1
Validation mIoU· 2022-02-20
Visual Attention Network Code
#152XCiT-S24/8 (UperNet)
48.1
Validation mIoU· 2021-06-17
XCiT: Cross-Covariance Image Transformers Code
#153MaskFormer(ResNet-101)
48.1
Validation mIoU· 2021-07-13
Per-Pixel Classification is Not All You Need for Semantic Segmentation Code
#154MAE (ViT-B, UperNet)
48.1
Validation mIoU· 2021-11-11
Masked Autoencoders Are Scalable Vision Learners Code
#155HRNetV2 + OCR + RMI (PaddleClas pretrained)SOTA
47.98
Validation mIoU· 2019-09-24
Segmentation Transformer: Object-Contextual Representations for Semantic Segmentation Code
#156Shift-B
47.9
Validation mIoU· 2022-01-26
When Shift Operation Meets Vision Transformer: An Extremely Simple Alternative to Attention Mechanism Code
#157Shift-S
47.8
Validation mIoU· 2022-01-26
When Shift Operation Meets Vision Transformer: An Extremely Simple Alternative to Attention Mechanism Code
#158MogaNet-S (Semantic FPN)
47.7
Validation mIoU· 2022-11-07
MogaNet: Multi-order Gated Aggregation Network Code
#159SeMask (SeMask Swin-S FPN)
47.63
Validation mIoU· 2021-12-23
SeMask: Semantically Masked Transformers for Semantic Segmentation Code
#160ResNeSt-269
47.6
Validation mIoU· 2020-04-19
ResNeSt: Split-Attention Networks Code
#161UperNet Shuffle-T
47.6
Validation mIoU· 2021-06-07
Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer Code
#162CondNet(ResNest-101)
47.54
Validation mIoU· 2021-09-21
CondNet: Conditional Classifier for Scene Segmentation Code
#163tiny-MOAT-3 (IN-1K pretraining, single scale)
47.5
Validation mIoU· 2022-10-04
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models Code
#164CondNet(ResNet-101)
47.38
Validation mIoU· 2021-09-21
CondNet: Conditional Classifier for Scene Segmentation Code
#165DiNAT-Mini (UperNet)
47.2
Validation mIoU· 2022-09-29
Dilated Neighborhood Attention Transformer Code
#166DCNAS
47.12
Validation mIoU· 2020-03-26
DCNAS: Densely Connected Neural Architecture Search for Semantic Image Segmentation
#167XCiT-S24/8 (Semantic-FPN)
47.1
Validation mIoU· 2021-06-17
XCiT: Cross-Covariance Image Transformers Code
#168ResNeSt-101
46.91
Validation mIoU· 2020-04-19
ResNeSt: Split-Attention Networks Code
#169XCiT-M24/8 (Semantic-FPN)
46.9
Validation mIoU· 2021-06-17
XCiT: Cross-Covariance Image Transformers Code
#170HamNet (ResNet-101)
46.8
Validation mIoU· 2021-09-09
Is Attention Better Than Matrix Decomposition?Code
#171Sequential Ensemble (DeepLabv3+)
46.8
Validation mIoU· 2022-10-08
Sequential Ensembling for Semantic Segmentation
#172ConvNeXt-T
46.7
Validation mIoU· 2022-01-10
A ConvNet for the 2020s Code
#173VAN-Base (Semantic-FPN)
46.7
Validation mIoU· 2022-02-20
Visual Attention Network Code
#174XCiT-S12/8 (UperNet)
46.6
Validation mIoU· 2021-06-17
XCiT: Cross-Covariance Image Transformers Code
#175GC ViT-T
46.5
Validation mIoU· 2022-06-20
Global Context Vision Transformers Code
#176NAT-Mini
46.4
Validation mIoU· 2022-04-14
Neighborhood Attention Transformer Code
#177Shift-T
46.3
Validation mIoU· 2022-01-26
When Shift Operation Meets Vision Transformer: An Extremely Simple Alternative to Attention Mechanism Code
#178DaViT-T
46.3
Validation mIoU· 2022-04-07
DaViT: Dual Attention Vision Transformers Code
#179CPN(ResNet-101)
46.27
Validation mIoU· 2020-04-03
Context Prior for Scene Segmentation Code
#180MultiMAE (ViT-B)
46.2
Validation mIoU· 2022-04-04
MultiMAE: Multi-modal Multi-task Masked Autoencoders Code
#181DRAN(ResNet-101)
46.18
Validation mIoU
No paperCode
#182PyConvSegNet-152
45.99
Validation mIoU· 2020-06-20
Pyramidal Convolution: Rethinking Convolutional Neural Networks for Visual Recognition Code
#183DNL
45.97
Validation mIoU· 2020-06-11
Disentangled Non-Local Neural Networks Code
#184ACNet (ResNet-101)
45.9
Validation mIoU· 2019-11-05
Adaptive Context Network for Scene Parsing
#185ACNet (ResNet-101)
45.9
Validation mIoU· 2019-11-05
Adaptive Context Network for Scene Parsing
#186HRViT-b1 (SegFormer, SS)
45.88
Validation mIoU· 2021-11-01
Multi-Scale High-Resolution Vision Transformer for Semantic Segmentation Code
#187OCR(HRNetV2-W48)
45.66
Validation mIoU· 2019-09-24
Segmentation Transformer: Object-Contextual Representations for Semantic Segmentation Code
#188SPNet (ResNet-101)
45.6
Validation mIoU· 2020-03-30
Strip Pooling: Rethinking Spatial Pooling for Scene Parsing Code
#189Swin-T (UPerNet) MoBY
45.58
Validation mIoU· 2021-05-10
Self-Supervised Learning with Swin Transformers Code
#190DAT-T (UperNet)
45.54
Validation mIoU· 2022-01-03
Vision Transformer with Deformable Attention Code
#191iBOT (ViT-S/16)
45.4
Validation mIoU· 2021-11-15
iBOT: Image BERT Pre-Training with Online Tokenizer Code
#192EANet (ResNet-101)
45.33
Validation mIoU· 2021-05-05
Beyond Self-attention: External Attention using Two Linear Layers for Visual Tasks Code
#193OCR (ResNet-101)
45.28
Validation mIoU· 2019-09-24
Segmentation Transformer: Object-Contextual Representations for Semantic Segmentation Code
#194Asymmetric ALNNSOTA
45.24
Validation mIoU· 2019-08-21
Asymmetric Non-local Neural Networks for Semantic Segmentation Code
#195Light-Ham (VAN-Small, D=256)
45.2
Validation mIoU· 2021-09-09
Is Attention Better Than Matrix Decomposition?Code
#196LaU-regression-loss
45.02
Validation mIoU· 2019-11-13
Location-aware Upsampling for Semantic Segmentation Code
#197PSPNetSOTA
44.94
Validation mIoU· 2016-12-04
Pyramid Scene Parsing Network Code
#198tiny-MOAT-2 (IN-1K pretraining, single scale)
44.9
Validation mIoU· 2022-10-04
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models Code
#199CFNet(ResNet-101)
44.89
Validation mIoU
No paperCode
#200EncNet
44.65
Validation mIoU· 2018-03-23
Context Encoding for Semantic Segmentation Code
#201LaU-offset-loss
44.55
Validation mIoU· 2019-11-13
Location-aware Upsampling for Semantic Segmentation Code
#202EncNet + JPU
44.34
Validation mIoU· 2019-03-28
FastFCN: Rethinking Dilated Convolution in the Backbone for Semantic Segmentation Code
#203SGR (ResNet-101)
44.32
Validation mIoU
No paperCode
#204XCiT-S12/8 (Semantic-FPN)
44.2
Validation mIoU· 2021-06-17
XCiT: Cross-Covariance Image Transformers Code
#205Auto-DeepLab-L
43.98
Validation mIoU· 2019-01-10
Auto-DeepLab: Hierarchical Neural Architecture Search for Semantic Image Segmentation Code
#206PSANet (ResNet-101)
43.77
Validation mIoU
No paperCode
#207DSSPN (ResNet-101)
43.68
Validation mIoU· 2018-03-16
Dynamic-structured Semantic Propagation Network
#208PSPNet (ResNet-152)
43.51
Validation mIoU· 2016-12-04
Pyramid Scene Parsing Network Code
#209PSPNet (ResNet-101)
43.29
Validation mIoU· 2016-12-04
Pyramid Scene Parsing Network Code
#210HRNetV2
43.2
Validation mIoU· 2019-04-09
High-Resolution Representations for Labeling Pixels and Regions Code
#211SeMask (SeMask Swin-T FPN)
43.16
Validation mIoU· 2021-12-23
SeMask: Semantically Masked Transformers for Semantic Segmentation Code
#212tiny-MOAT-1 (IN-1K pretraining, single scale)
43.1
Validation mIoU· 2022-10-04
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models Code
#213VAN-Small
42.9
Validation mIoU· 2022-02-20
Visual Attention Network Code
#214PoolFormer-M48
42.7
Validation mIoU· 2021-11-22
MetaFormer Is Actually What You Need for Vision Code
#215UperNet (ResNet-101)
42.66
Validation mIoU· 2018-07-26
Unified Perceptual Parsing for Scene Understanding Code
#216tiny-MOAT-0 (IN-1K pretraining, single scale)
41.2
Validation mIoU· 2022-10-04
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models Code
#217RefineNetSOTA
40.7
Validation mIoU· 2016-11-20
RefineNet: Multi-Path Refinement Networks for High-Resolution Semantic Segmentation Code
#218FBNetV5
40.4
Validation mIoU· 2021-11-19
FBNetV5: Neural Architecture Search for Multiple Tasks in One Run
#219ConvMLP-L
40
Validation mIoU· 2021-09-09
ConvMLP: Hierarchical Convolutional MLPs for Vision Code
#220ConvMLP-M
38.6
Validation mIoU· 2021-09-09
ConvMLP: Hierarchical Convolutional MLPs for Vision Code
#221VAN-Tiny
38.5
Validation mIoU· 2022-02-20
Visual Attention Network Code
#222A2MIM (ResNet-50)
38.3
Validation mIoU· 2022-05-27
Architecture-Agnostic Masked Image Modeling -- From ViT back to CNN Code
#223iBOT (ViT-B/16) (linear head)
38.3
Validation mIoU· 2021-11-15
iBOT: Image BERT Pre-Training with Online Tokenizer Code
#224SegFormer-B0
37.4
Validation mIoU· Extra Data· 2021-05-31
SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers Code
#225MUXNet-m + PPM
35.8
Validation mIoU· 2020-03-31
MUXConv: Information Multiplexing in Convolutional Neural Networks Code
#226ConvMLP-S
35.8
Validation mIoU· 2021-09-09
ConvMLP: Hierarchical Convolutional MLPs for Vision Code
#227MUXNet-m + C1
32.42
Validation mIoU· 2020-03-31
MUXConv: Information Multiplexing in Convolutional Neural Networks Code
#228DilatedNetSOTA
32.31
Validation mIoU· 2015-11-23
Multi-Scale Context Aggregation by Dilated Convolutions Code
#229FCNSOTA
29.39
Validation mIoU· Extra Data· 2014-11-14
Fully Convolutional Networks for Semantic Segmentation Code
#230SegNet
21.64
Validation mIoU· 2015-11-02
SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation Code