TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

SotA/Audio/10-shot image generation/ADE20K val

10-shot image generation on ADE20K val

Metric: mIoU (higher is better)

LeaderboardDataset
Loading chart...

Results

Submit a result
#Model↕mIoU▼Extra DataPaperDate↕Code
1BEiT-362.8YesImage as a Foreign Language: BEiT Pretraining fo...2022-08-22Code
2ViT-CoMer62.1No--Code
3EVA61.5NoEVA: Exploring the Limits of Masked Visual Repre...2022-11-14Code
4FD-SwinV2-G61.4YesContrastive Learning Rivals Masked Image Modelin...2022-05-27Code
5MaskDINO-SwinL60.8YesMask DINO: Towards A Unified Transformer-based F...2022-06-06Code
6OneFormer (InternImage-H, emb_dim=256, multi-scale, 896x896)60.8NoOneFormer: One Transformer to Rule Universal Ima...2022-11-10Code
7ViT-Adapter-L (Mask2Former, BEiT pretrain)60.5YesVision Transformer Adapter for Dense Predictions2022-05-17Code
8OneFormer (InternImage-H, emb_dim=256, single-scale, 896x896)60.4NoOneFormer: One Transformer to Rule Universal Ima...2022-11-10Code
9SERNet-Former_v259.35YesSERNet-Former: Semantic Segmentation by Efficien...2024-01-28Code
10X-Decoder (Davit-d5, Deform, single-scale, 1280x1280)59.1YesGeneralized Decoding for Pixel, Image, and Langu...2022-12-21Code
11OneFormer (DiNAT-L, single-scale, 1280x1280, COCO-Pretrain)58.9YesOneFormer: One Transformer to Rule Universal Ima...2022-11-10Code
12OneFormer (DiNAT-L, multi-scale, 896x896)58.6NoOneFormer: One Transformer to Rule Universal Ima...2022-11-10Code
13ViT-Adapter-L (UperNet, BEiT pretrain)58.4YesVision Transformer Adapter for Dense Predictions2022-05-17Code
14OneFormer (DiNAT-L, multi-scale, 640x640)58.4NoOneFormer: One Transformer to Rule Universal Ima...2022-11-10Code
15RSSeg-ViT-L(BEiT pretrain)58.4NoRepresentation Separation for Semantic Segmentat...2022-12-28-
16EoMT (DINOv2-L, single-scale, 512x512)58.4NoYour ViT is Secretly an Image Segmentation Model2025-03-24Code
17OneFormer (Swin-L, multi-scale, 896x896)58.3NoOneFormer: One Transformer to Rule Universal Ima...2022-11-10Code
18OneFormer (DiNAT-L, single-scale, 1280x1280)58.3NoOneFormer: One Transformer to Rule Universal Ima...2022-11-10Code
19OneFormer (DiNAT-L, single-scale, 640x640)58.3NoOneFormer: One Transformer to Rule Universal Ima...2022-11-10Code
20SeMask (SeMask Swin-L FaPN-Mask2Former)58.2NoSeMask: Semantically Masked Transformers for Sem...2021-12-23Code
21SeMask (SeMask Swin-L MSFaPN-Mask2Former)58.2NoSeMask: Semantically Masked Transformers for Sem...2021-12-23Code
22DiNAT-L (Mask2Former)58.1NoDilated Neighborhood Attention Transformer2022-09-29Code
23X-Decoder (L)58.1YesGeneralized Decoding for Pixel, Image, and Langu...2022-12-21Code
24Mask2Former (Swin-L-FaPN, multiscale)57.7NoMasked-attention Mask Transformer for Universal ...2021-12-02Code
25OneFormer (Swin-L, multi-scale, 640x640)57.7NoOneFormer: One Transformer to Rule Universal Ima...2022-11-10Code
26SeMask (SeMask Swin-L Mask2Former)57.5NoSeMask: Semantically Masked Transformers for Sem...2021-12-23Code
27OneFormer (ConvNeXt-XL, single-scale, 640x640)57.4NoOneFormer: One Transformer to Rule Universal Ima...2022-11-10Code
28SenFormer (BEiT-L)57.1NoEfficient Self-Ensemble for Semantic Segmentation2021-11-26Code
29BEiT-L (ViT+UperNet, ImageNet-22k pretrain)57NoBEiT: BERT Pre-Training of Image Transformers2021-06-15Code
30SeMask (SeMask Swin-L MSFaPN-Mask2Former, single-scale)57NoSeMask: Semantically Masked Transformers for Sem...2021-12-23Code
31OneFormer (Swin-L, single-scale, 1280x1280)57NoOneFormer: One Transformer to Rule Universal Ima...2022-11-10Code
32OneFormer (Swin-L, single-scale, 640x640)57NoOneFormer: One Transformer to Rule Universal Ima...2022-11-10Code
33FaPN (MaskFormer, Swin-L, ImageNet-22k pretrain)56.7NoFaPN: Feature-aligned Pyramid Network for Dense ...2021-08-16Code
34OneFormer (ConvNeXt-L, single-scale, 640x640)56.6NoOneFormer: One Transformer to Rule Universal Ima...2022-11-10Code
35Mask2Former (Swin-L-FaPN)56.4NoMasked-attention Mask Transformer for Universal ...2021-12-02Code
36DiNAT-L (Mask2Former, 640x640)56.3NoDilated Neighborhood Attention Transformer2022-09-29Code
37SeMask (SeMask Swin-L MaskFormer)56.2NoSeMask: Semantically Masked Transformers for Sem...2021-12-23Code
38CSWin-L (UperNet, ImageNet-22k pretrain)55.7NoCSWin Transformer: A General Vision Transformer ...2021-07-01Code
39MaskFormer (Swin-L, ImageNet-22k pretrain)55.6NoPer-Pixel Classification is Not All You Need for...2021-07-13Code
40DeiT-L55.6NoDeiT III: Revenge of the ViT2022-04-14Code
41Focal-L (UperNet, ImageNet-22k pretrain)55.4NoFocal Self-attention for Local-Global Interactio...2021-07-01Code
42Mask2Former (Swin-L + FAPN, 640x640)55.4NoMasked-attention Mask Transformer for Universal ...2021-12-02Code
43SegViT ViT-Large55.2NoSegViT: Semantic Segmentation with Plain Vision ...2022-10-12Code
44kMaX-DeepLab (ConvNeXt-L, single-scale, 1281x1281)55.2NokMaX-DeepLab: k-means Mask Transformer2022-07-08Code
45kMaX-DeepLab (ConvNeXt-L, single-scale, 641x641)54.8NokMaX-DeepLab: k-means Mask Transformer2022-07-08Code
46Mask2Former (Swin-L)54.5NoMasked-attention Mask Transformer for Universal ...2021-12-02Code
47K-Net54.3NoK-Net: Towards Unified Image Segmentation2021-06-28Code
48DEPICT-SA (ViT-L 640x640 multi-scale)54.3NoRethinking Decoders for Transformer-based Semant...2024-11-05Code
49SenFormer (Swin-L)54.2NoEfficient Self-Ensemble for Semantic Segmentation2021-11-26Code
50DeiT-B54.1NoDeiT III: Revenge of the ViT2022-04-14Code
51MixMIM-L53.8NoMixMAE: Mixed and Masked Autoencoder for Efficie...2022-05-26Code
52Seg-L-Mask/16 (MS, ViT-L)53.63NoSegmenter: Transformer for Semantic Segmentation2021-05-12Code
53Swin-L (UperNet, ImageNet-22k pretrain)53.5NoSwin Transformer: Hierarchical Vision Transforme...2021-03-25Code
54SeMask (SeMask Swin-L FPN)53.5NoSeMask: Semantically Masked Transformers for Sem...2021-12-23Code
55PatchConvNet-L120 (UperNet)52.9NoAugmenting Convolutional networks with attention...2021-12-27Code
56DEPICT-SA (ViT-L 640x640 single-scale)52.9NoRethinking Decoders for Transformer-based Semant...2024-11-05Code
57PatchConvNet-B120 (UperNet)52.8NoAugmenting Convolutional networks with attention...2021-12-27Code
58SegFormer-B5(MS, 87M #Params, ImageNet-1K pretrain)51.8NoSegFormer: Simple and Efficient Design for Seman...2021-05-31Code
59Light-Ham (VAN-Huge, 61M, IN-1k, MS)51.5NoIs Attention Better Than Matrix Decomposition?2021-09-09Code
60PatchConvNet-B60 (UperNet)51.1NoAugmenting Convolutional networks with attention...2021-12-27Code
61Light-Ham (VAN-Large, 46M, IN-1k, MS)51NoIs Attention Better Than Matrix Decomposition?2021-09-09Code
62UperNet Shuffle-B50.5NoShuffle Transformer: Rethinking Spatial Shuffle ...2021-06-07Code
63ELSA-Swin-S50.3NoELSA: Enhanced Local Self-Attention for Vision T...2021-12-23Code
64MixMIM-B50.3NoMixMAE: Mixed and Masked Autoencoder for Efficie...2022-05-26Code
65Twins-SVT-L (UperNet, ImageNet-1k pretrain)50.2NoTwins: Revisiting the Design of Spatial Attentio...2021-04-28Code
66Seg-B-Mask/16 (MS, ViT-B)50NoSegmenter: Transformer for Semantic Segmentation2021-05-12Code
67Panoptic-DeepLab (SwideRNet)50NoMasked-attention Mask Transformer for Universal ...2021-12-02Code
68Swin-B (UperNet, ImageNet-1k pretrain)49.7NoSwin Transformer: Hierarchical Vision Transforme...2021-03-25Code
69gSwin-S49.69NogSwin: Gated MLP Vision Model with Hierarchical ...2022-08-24-
70Seg-B/8 (MS, ViT-B)49.61NoSegmenter: Transformer for Semantic Segmentation2021-05-12Code
71UperNet Shuffle-S49.6NoShuffle Transformer: Rethinking Spatial Shuffle ...2021-06-07Code
72Light-Ham (VAN-Base, 27M, IN-1k, MS)49.6NoIs Attention Better Than Matrix Decomposition?2021-09-09Code
73PatchConvNet-S60 (UperNet)49.3NoAugmenting Convolutional networks with attention...2021-12-27Code
74DPT-Hybrid49.02NoVision Transformers for Dense Prediction2021-03-24Code
75DaViT-S (UperNet)48.8NoDaViT: Dual Attention Vision Transformers2022-04-07Code
76ResNeSt-20048.36NoResNeSt: Split-Attention Networks2020-04-19Code
77HRNetV2 + OCR + RMI (PaddleClas pretrained)47.98NoSegmentation Transformer: Object-Contextual Repr...2019-09-24Code
78gSwin-T47.63NogSwin: Gated MLP Vision Model with Hierarchical ...2022-08-24-
79ResNeSt-26947.6NoResNeSt: Split-Attention Networks2020-04-19Code
80UperNet Shuffle-T47.6NoShuffle Transformer: Rethinking Spatial Shuffle ...2021-06-07Code
81DCNAS47.12NoDCNAS: Densely Connected Neural Architecture Sea...2020-03-26-
82ResNeSt-10146.91NoResNeSt: Split-Attention Networks2020-04-19Code
83Seg-S-Mask/16 (MS, ViT-S)46.9No---
84Swin-S (RPE w/ GAB)46.41NoUnderstanding Gaussian Attention Bias of Vision ...2023-05-08Code
85DaViT-B (UperNet)46.3NoDaViT: Dual Attention Vision Transformers2022-04-07Code
86CPN(ResNet-101)46.27NoContext Prior for Scene Segmentation2020-04-03Code
87MultiMAE (ViT-B)46.2NoMultiMAE: Multi-modal Multi-task Masked Autoenco...2022-04-04Code
88Mask2Former (ResNet-50, 640x640)46.1NoMasked-attention Mask Transformer for Universal ...2021-12-02Code
89PyConvSegNet-15245.99NoPyramidal Convolution: Rethinking Convolutional ...2020-06-20Code
90DNL45.97NoDisentangled Non-Local Neural Networks2020-06-11Code
91CTNet45.94NoCTNet: Context-based Tandem Network for Semantic...2021-04-20Code
92ACNet (ResNet-101)45.9NoAdaptive Context Network for Scene Parsing2019-11-05-
93ACNet(ResNet-101)45.9NoAdaptive Context Network for Scene Parsing2019-11-05-
94OCR (HRNetV2-W48)45.66NoSegmentation Transformer: Object-Contextual Repr...2019-09-24Code
95EANet (ResNet-101)45.33NoBeyond Self-attention: External Attention using ...2021-05-05Code
96kMaX-DeepLab (ResNet50, single-scale, 1281x1281)45.3NokMaX-DeepLab: k-means Mask Transformer2022-07-08Code
97OCR (ResNet-101)45.28NoSegmentation Transformer: Object-Contextual Repr...2019-09-24Code
98Asymmetric ALNN45.24NoAsymmetric Non-local Neural Networks for Semanti...2019-08-21Code
99gSwin-VT45.07NogSwin: Gated MLP Vision Model with Hierarchical ...2022-08-24-
100LaU-regression-loss45.02NoLocation-aware Upsampling for Semantic Segmentat...2019-11-13Code
101kMaX-DeepLab (ResNet50, single-scale, 641x641)45NokMaX-DeepLab: k-means Mask Transformer2022-07-08Code
102EncNet (ResNet-101)44.65NoContext Encoding for Semantic Segmentation2018-03-23Code
103SGR (ResNet-101)44.32No--Code
104Auto-DeepLab-L43.98NoAuto-DeepLab: Hierarchical Neural Architecture S...2019-01-10Code
105PSANet (ResNet-101)43.77No--Code
106DSSPN (ResNet-101)43.68NoDynamic-structured Semantic Propagation Network2018-03-16-
107HRNetV2 (HRNetV2-W48)42.99NoHigh-Resolution Representations for Labeling Pix...2019-04-09Code
108UperNet (ResNet-101)42.66NoUnified Perceptual Parsing for Scene Understanding2018-07-26Code
109RefineNet (ResNet-152)40.7NoRefineNet: Multi-Path Refinement Networks for Hi...2016-11-20Code
110RefineNet (ResNet-101)40.2NoRefineNet: Multi-Path Refinement Networks for Hi...2016-11-20Code
111DHR (Swin-L, Mask2Former)32.9NoDHR: Dual Features-Driven Hierarchical Rebalanci...2024-03-30Code