Tasks
SotA
Datasets
Papers
Methods
Submit
About
SotA
/
Audio
/
10-shot image generation
/
ADE20K
10-shot image generation on ADE20K
Metric: Params (M) (higher is better)
Leaderboard
Dataset
Loading chart...
Results
Submit a result
Hide extra data
Export CSV
Sort:
Params (M) (best first)
Params (M) (worst first)
Date (newest first)
Date (oldest first)
Model name (A→Z)
#
Model
↕
Params (M)
▼
Extra Data
Paper
Date
↕
Code
1
FD-SwinV2-G
3000
No
Contrastive Learning Rivals Masked Image Modelin...
2022-05-27
Code
2
RevCol-H (Mask2Former)
2439
Yes
Reversible Column Networks
2022-12-22
Code
3
BEiT-3
1900
Yes
Image as a Foreign Language: BEiT Pretraining fo...
2022-08-22
Code
4
ViT-P (InternImage-H)
1610
Yes
The Missing Point in Vision Transformers for Uni...
2025-05-26
Code
5
ONE-PEACE
1500
Yes
ONE-PEACE: Exploring One General Representation ...
2023-05-18
Code
6
ViT-P (OneFormer, InternImage-H)
1400
No
The Missing Point in Vision Transformers for Uni...
2025-05-26
Code
7
InternImage-H
1310
Yes
InternImage: Exploring Large-Scale Vision Founda...
2022-11-10
Code
8
M3I Pre-training (InternImage-H)
1310
Yes
Towards All-in-one Pre-training via Maximizing M...
2022-11-17
Code
9
InternImage-H (M3I Pre-training)
1310
No
InternImage: Exploring Large-Scale Vision Founda...
2022-11-10
Code
10
DINOv2 (ViT-g/14 frozen model, w/ ViT-Adapter + Mask2former)
1080
No
DINOv2: Learning Robust Visual Features without ...
2023-04-14
Code
11
EVA
1074
Yes
EVA: Exploring the Limits of Masked Visual Repre...
2022-11-14
Code
12
ViT-Adapter-L (Mask2Former, BEiTv2 pretrain)
571
Yes
Vision Transformer Adapter for Dense Predictions
2022-05-17
Code
13
ViT-Adapter-L (Mask2Former, BEiT pretrain)
571
Yes
Vision Transformer Adapter for Dense Predictions
2022-05-17
Code
14
MOAT-4 (IN-22K pretraining, single-scale)
496
No
MOAT: Alternating Mobile Convolution and Attenti...
2022-10-04
Code
15
ViT-Adapter-L (UperNet, BEiT pretrain)
451
No
Vision Transformer Adapter for Dense Predictions
2022-05-17
Code
16
ConvNeXt-XL++
391
No
A ConvNet for the 2020s
2022-01-10
Code
17
InternImage-XL
368
No
InternImage: Exploring Large-Scale Vision Founda...
2022-11-10
Code
18
RSSeg-ViT-L (BEiT pretrain)
330
No
Representation Separation for Semantic Segmentat...
2022-12-28
-
19
EoMT (DINOv2-L, single-scale, 512x512)
316
No
Your ViT is Secretly an Image Segmentation Model
2025-03-24
Code
20
ViT-P (OneFormer, DiNAT-L)
309
No
The Missing Point in Vision Transformers for Uni...
2025-05-26
Code
21
InternImage-L
256
No
InternImage: Exploring Large-Scale Vision Founda...
2022-11-10
Code
22
ConvNeXt-L++
235
No
A ConvNet for the 2020s
2022-01-10
Code
23
MasK DINO (SwinL, multi-scale)
223
Yes
Mask DINO: Towards A Unified Transformer-based F...
2022-06-06
Code
24
Sequential Ensemble (SegFormer)
216.3
No
Sequential Ensembling for Semantic Segmentation
2022-10-08
-
25
LV-ViT-L (UperNet, MS)
209
No
All Tokens Matter: Token Labeling for Training B...
2021-04-22
Code
26
DDP (Swin-L, step-3)
207
No
DDP: Diffusion Model for Dense Visual Prediction
2023-03-30
Code
27
MOAT-3 (IN-22K pretraining, single-scale)
198
No
MOAT: Alternating Mobile Convolution and Attenti...
2022-10-04
Code
28
InternImage-B
128
No
InternImage: Exploring Large-Scale Vision Founda...
2022-11-10
Code
29
GC ViT-B
125
No
Global Context Vision Transformers
2022-06-20
Code
30
NAT-Base
123
No
Neighborhood Attention Transformer
2022-04-14
Code
31
ConvNeXt-B++
122
No
A ConvNet for the 2020s
2022-01-10
Code
32
ConvNeXt-B
122
No
A ConvNet for the 2020s
2022-01-10
Code
33
DAT-B (UperNet)
121
No
Vision Transformer with Deformable Attention
2022-01-03
Code
34
TransNeXt-Base (IN-1K pretrain, Mask2Former, 512)
109
No
TransNeXt: Robust Foveal Visual Perception for V...
2023-11-28
Code
35
ActiveMLP-L(UperNet)
108
No
Active Token Mixer
2022-03-11
Code
36
SeMask (SeMask Swin-B FPN)
96
No
SeMask: Semantically Masked Transformers for Sem...
2021-12-23
Code
37
SegFormer-B5
84.7
Yes
SegFormer: Simple and Efficient Design for Seman...
2021-05-31
Code
38
GC ViT-S
84
No
Global Context Vision Transformers
2022-06-20
Code
39
ConvNeXt-S
82
No
A ConvNet for the 2020s
2022-01-10
Code
40
NAT-Small
82
No
Neighborhood Attention Transformer
2022-04-14
Code
41
MOAT-2 (IN-22K pretraining, single-scale)
81
No
MOAT: Alternating Mobile Convolution and Attenti...
2022-10-04
Code
42
DAT-S (UperNet)
81
No
Vision Transformer with Deformable Attention
2022-01-03
Code
43
InternImage-S
80
No
InternImage: Exploring Large-Scale Vision Founda...
2022-11-10
Code
44
TransNeXt-Small (IN-1K pretrain, Mask2Former, 512)
69
No
TransNeXt: Robust Foveal Visual Perception for V...
2023-11-28
Code
45
SegFormer-B4
64.1
Yes
SegFormer: Simple and Efficient Design for Seman...
2021-05-31
Code
46
Light-Ham (VAN-Huge)
61.1
No
Is Attention Better Than Matrix Decomposition?
2021-09-09
Code
47
ConvNeXt-T
60
No
A ConvNet for the 2020s
2022-01-10
Code
48
DAT-T (UperNet)
60
No
Vision Transformer with Deformable Attention
2022-01-03
Code
49
InternImage-T
59
No
InternImage: Exploring Large-Scale Vision Founda...
2022-11-10
Code
50
NAT-Tiny
58
No
Neighborhood Attention Transformer
2022-04-14
Code
51
GC ViT-T
58
No
Global Context Vision Transformers
2022-06-20
Code
52
SeMask (SeMask Swin-S FPN)
56
No
SeMask: Semantically Masked Transformers for Sem...
2021-12-23
Code
53
VAN-Large (HamNet)
55
No
Visual Attention Network
2022-02-20
Code
54
NAT-Mini
50
No
Neighborhood Attention Transformer
2022-04-14
Code
55
VAN-Large
49
No
Visual Attention Network
2022-02-20
Code
56
TransNeXt-Tiny (IN-1K pretrain, Mask2Former, 512)
47.5
No
TransNeXt: Robust Foveal Visual Perception for V...
2023-11-28
Code
57
Light-Ham (VAN-Large)
45.6
No
Is Attention Better Than Matrix Decomposition?
2021-09-09
Code
58
SeMask (SeMask Swin-T FPN)
35
No
SeMask: Semantically Masked Transformers for Sem...
2021-12-23
Code
59
HRViT-b3 (SegFormer, SS)
28.7
No
Multi-Scale High-Resolution Vision Transformer f...
2021-11-01
Code
60
Light-Ham (VAN-Base)
27.4
No
Is Attention Better Than Matrix Decomposition?
2021-09-09
Code
61
tiny-MOAT-3 (IN-1K pretraining, single scale)
24
No
MOAT: Alternating Mobile Convolution and Attenti...
2022-10-04
Code
62
HRViT-b2 (SegFormer, SS)
20.8
No
Multi-Scale High-Resolution Vision Transformer f...
2021-11-01
Code
63
VAN-Small
18
No
Visual Attention Network
2022-02-20
Code
64
Light-Ham (VAN-Small, D=256)
13.8
No
Is Attention Better Than Matrix Decomposition?
2021-09-09
Code
65
tiny-MOAT-2 (IN-1K pretraining, single scale)
13
No
MOAT: Alternating Mobile Convolution and Attenti...
2022-10-04
Code
66
HRViT-b1 (SegFormer, SS)
8.2
No
Multi-Scale High-Resolution Vision Transformer f...
2021-11-01
Code
67
tiny-MOAT-1 (IN-1K pretraining, single scale)
8
No
MOAT: Alternating Mobile Convolution and Attenti...
2022-10-04
Code
68
VAN-Tiny
8
No
Visual Attention Network
2022-02-20
Code
69
tiny-MOAT-0 (IN-1K pretraining, single scale)
6
No
MOAT: Alternating Mobile Convolution and Attenti...
2022-10-04
Code
70
SegFormer-B0
3.8
Yes
SegFormer: Simple and Efficient Design for Seman...
2021-05-31
Code
#1
FD-SwinV2-G
SOTA
3000
Params (M)
· 2022-05-27
Contrastive Learning Rivals Masked Image Modeling in Fine-tuning via Feature Distillation
Code
#2
RevCol-H (Mask2Former)
2439
Params (M)
· Extra Data
· 2022-12-22
Reversible Column Networks
Code
#3
BEiT-3
1900
Params (M)
· Extra Data
· 2022-08-22
Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks
Code
#4
ViT-P (InternImage-H)
1610
Params (M)
· Extra Data
· 2025-05-26
The Missing Point in Vision Transformers for Universal Image Segmentation
Code
#5
ONE-PEACE
1500
Params (M)
· Extra Data
· 2023-05-18
ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities
Code
#6
ViT-P (OneFormer, InternImage-H)
1400
Params (M)
· 2025-05-26
The Missing Point in Vision Transformers for Universal Image Segmentation
Code
#7
InternImage-H
1310
Params (M)
· Extra Data
· 2022-11-10
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions
Code
#8
M3I Pre-training (InternImage-H)
1310
Params (M)
· Extra Data
· 2022-11-17
Towards All-in-one Pre-training via Maximizing Multi-modal Mutual Information
Code
#9
InternImage-H (M3I Pre-training)
1310
Params (M)
· 2022-11-10
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions
Code
#10
DINOv2 (ViT-g/14 frozen model, w/ ViT-Adapter + Mask2former)
1080
Params (M)
· 2023-04-14
DINOv2: Learning Robust Visual Features without Supervision
Code
#11
EVA
1074
Params (M)
· Extra Data
· 2022-11-14
EVA: Exploring the Limits of Masked Visual Representation Learning at Scale
Code
#12
ViT-Adapter-L (Mask2Former, BEiTv2 pretrain)
SOTA
571
Params (M)
· Extra Data
· 2022-05-17
Vision Transformer Adapter for Dense Predictions
Code
#13
ViT-Adapter-L (Mask2Former, BEiT pretrain)
571
Params (M)
· Extra Data
· 2022-05-17
Vision Transformer Adapter for Dense Predictions
Code
#14
MOAT-4 (IN-22K pretraining, single-scale)
496
Params (M)
· 2022-10-04
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models
Code
#15
ViT-Adapter-L (UperNet, BEiT pretrain)
451
Params (M)
· 2022-05-17
Vision Transformer Adapter for Dense Predictions
Code
#16
ConvNeXt-XL++
SOTA
391
Params (M)
· 2022-01-10
A ConvNet for the 2020s
Code
#17
InternImage-XL
368
Params (M)
· 2022-11-10
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions
Code
#18
RSSeg-ViT-L (BEiT pretrain)
330
Params (M)
· 2022-12-28
Representation Separation for Semantic Segmentation with Vision Transformers
#19
EoMT (DINOv2-L, single-scale, 512x512)
316
Params (M)
· 2025-03-24
Your ViT is Secretly an Image Segmentation Model
Code
#20
ViT-P (OneFormer, DiNAT-L)
309
Params (M)
· 2025-05-26
The Missing Point in Vision Transformers for Universal Image Segmentation
Code
#21
InternImage-L
256
Params (M)
· 2022-11-10
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions
Code
#22
ConvNeXt-L++
235
Params (M)
· 2022-01-10
A ConvNet for the 2020s
Code
#23
MasK DINO (SwinL, multi-scale)
223
Params (M)
· Extra Data
· 2022-06-06
Mask DINO: Towards A Unified Transformer-based Framework for Object Detection and Segmentation
Code
#24
Sequential Ensemble (SegFormer)
216.3
Params (M)
· 2022-10-08
Sequential Ensembling for Semantic Segmentation
#25
LV-ViT-L (UperNet, MS)
SOTA
209
Params (M)
· 2021-04-22
All Tokens Matter: Token Labeling for Training Better Vision Transformers
Code
#26
DDP (Swin-L, step-3)
207
Params (M)
· 2023-03-30
DDP: Diffusion Model for Dense Visual Prediction
Code
#27
MOAT-3 (IN-22K pretraining, single-scale)
198
Params (M)
· 2022-10-04
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models
Code
#28
InternImage-B
128
Params (M)
· 2022-11-10
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions
Code
#29
GC ViT-B
125
Params (M)
· 2022-06-20
Global Context Vision Transformers
Code
#30
NAT-Base
123
Params (M)
· 2022-04-14
Neighborhood Attention Transformer
Code
#31
ConvNeXt-B++
122
Params (M)
· 2022-01-10
A ConvNet for the 2020s
Code
#32
ConvNeXt-B
122
Params (M)
· 2022-01-10
A ConvNet for the 2020s
Code
#33
DAT-B (UperNet)
121
Params (M)
· 2022-01-03
Vision Transformer with Deformable Attention
Code
#34
TransNeXt-Base (IN-1K pretrain, Mask2Former, 512)
109
Params (M)
· 2023-11-28
TransNeXt: Robust Foveal Visual Perception for Vision Transformers
Code
#35
ActiveMLP-L(UperNet)
108
Params (M)
· 2022-03-11
Active Token Mixer
Code
#36
SeMask (SeMask Swin-B FPN)
96
Params (M)
· 2021-12-23
SeMask: Semantically Masked Transformers for Semantic Segmentation
Code
#37
SegFormer-B5
84.7
Params (M)
· Extra Data
· 2021-05-31
SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers
Code
#38
GC ViT-S
84
Params (M)
· 2022-06-20
Global Context Vision Transformers
Code
#39
ConvNeXt-S
82
Params (M)
· 2022-01-10
A ConvNet for the 2020s
Code
#40
NAT-Small
82
Params (M)
· 2022-04-14
Neighborhood Attention Transformer
Code
#41
MOAT-2 (IN-22K pretraining, single-scale)
81
Params (M)
· 2022-10-04
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models
Code
#42
DAT-S (UperNet)
81
Params (M)
· 2022-01-03
Vision Transformer with Deformable Attention
Code
#43
InternImage-S
80
Params (M)
· 2022-11-10
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions
Code
#44
TransNeXt-Small (IN-1K pretrain, Mask2Former, 512)
69
Params (M)
· 2023-11-28
TransNeXt: Robust Foveal Visual Perception for Vision Transformers
Code
#45
SegFormer-B4
64.1
Params (M)
· Extra Data
· 2021-05-31
SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers
Code
#46
Light-Ham (VAN-Huge)
61.1
Params (M)
· 2021-09-09
Is Attention Better Than Matrix Decomposition?
Code
#47
ConvNeXt-T
60
Params (M)
· 2022-01-10
A ConvNet for the 2020s
Code
#48
DAT-T (UperNet)
60
Params (M)
· 2022-01-03
Vision Transformer with Deformable Attention
Code
#49
InternImage-T
59
Params (M)
· 2022-11-10
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions
Code
#50
NAT-Tiny
58
Params (M)
· 2022-04-14
Neighborhood Attention Transformer
Code
#51
GC ViT-T
58
Params (M)
· 2022-06-20
Global Context Vision Transformers
Code
#52
SeMask (SeMask Swin-S FPN)
56
Params (M)
· 2021-12-23
SeMask: Semantically Masked Transformers for Semantic Segmentation
Code
#53
VAN-Large (HamNet)
55
Params (M)
· 2022-02-20
Visual Attention Network
Code
#54
NAT-Mini
50
Params (M)
· 2022-04-14
Neighborhood Attention Transformer
Code
#55
VAN-Large
49
Params (M)
· 2022-02-20
Visual Attention Network
Code
#56
TransNeXt-Tiny (IN-1K pretrain, Mask2Former, 512)
47.5
Params (M)
· 2023-11-28
TransNeXt: Robust Foveal Visual Perception for Vision Transformers
Code
#57
Light-Ham (VAN-Large)
45.6
Params (M)
· 2021-09-09
Is Attention Better Than Matrix Decomposition?
Code
#58
SeMask (SeMask Swin-T FPN)
35
Params (M)
· 2021-12-23
SeMask: Semantically Masked Transformers for Semantic Segmentation
Code
#59
HRViT-b3 (SegFormer, SS)
28.7
Params (M)
· 2021-11-01
Multi-Scale High-Resolution Vision Transformer for Semantic Segmentation
Code
#60
Light-Ham (VAN-Base)
27.4
Params (M)
· 2021-09-09
Is Attention Better Than Matrix Decomposition?
Code
#61
tiny-MOAT-3 (IN-1K pretraining, single scale)
24
Params (M)
· 2022-10-04
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models
Code
#62
HRViT-b2 (SegFormer, SS)
20.8
Params (M)
· 2021-11-01
Multi-Scale High-Resolution Vision Transformer for Semantic Segmentation
Code
#63
VAN-Small
18
Params (M)
· 2022-02-20
Visual Attention Network
Code
#64
Light-Ham (VAN-Small, D=256)
13.8
Params (M)
· 2021-09-09
Is Attention Better Than Matrix Decomposition?
Code
#65
tiny-MOAT-2 (IN-1K pretraining, single scale)
13
Params (M)
· 2022-10-04
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models
Code
#66
HRViT-b1 (SegFormer, SS)
8.2
Params (M)
· 2021-11-01
Multi-Scale High-Resolution Vision Transformer for Semantic Segmentation
Code
#67
tiny-MOAT-1 (IN-1K pretraining, single scale)
8
Params (M)
· 2022-10-04
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models
Code
#68
VAN-Tiny
8
Params (M)
· 2022-02-20
Visual Attention Network
Code
#69
tiny-MOAT-0 (IN-1K pretraining, single scale)
6
Params (M)
· 2022-10-04
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models
Code
#70
SegFormer-B0
3.8
Params (M)
· Extra Data
· 2021-05-31
SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers
Code