TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

SotA/Audio/10-shot image generation/ADE20K val

10-shot image generation on ADE20K val

Metric: PQ (higher is better)

LeaderboardDataset
Loading chart...

Results

Submit a result
#Model↕PQ▼Extra DataPaperDate↕Code
1OneFormer (InternImage-H, emb_dim=256, single-scale, 896x896)54.5NoOneFormer: One Transformer to Rule Universal Ima...2022-11-10Code
2ViT-P (OneFormer, DiNAT-L, single-scale, 1280x1280, COCO_pretrain)54YesThe Missing Point in Vision Transformers for Uni...2025-05-26Code
3OpenSeed(SwinL, single scale, 1280x1280)53.7YesA Simple Framework for Open-Vocabulary Segmentat...2023-03-14Code
4OneFormer (DiNAT-L, single-scale, 1280x1280, COCO-Pretrain)53.4YesOneFormer: One Transformer to Rule Universal Ima...2022-11-10Code
5EoMT (DINOv2-g, single-scale, 1280x1280, COCO pre-trained)52.8YesYour ViT is Secretly an Image Segmentation Model2025-03-24Code
6X-Decoder (Davit-d5, Deform, single-scale, 1280x1280)52.4YesGeneralized Decoding for Pixel, Image, and Langu...2022-12-21Code
7ViT-P (OneFormer, DiNAT-L, single-scale, 1280x1280)51.9NoThe Missing Point in Vision Transformers for Uni...2025-05-26Code
8OneFormer (DiNAT-L, single-scale, 1280x1280)51.5NoOneFormer: One Transformer to Rule Universal Ima...2022-11-10Code
9OneFormer (Swin-L, single-scale, 1280x1280)51.4NoOneFormer: One Transformer to Rule Universal Ima...2022-11-10Code
10kMaX-DeepLab (ConvNeXt-L, single-scale, 1281x1281)50.9NokMaX-DeepLab: k-means Mask Transformer2022-07-08Code
11OneFormer (DiNAT-L, single-scale, 640x640)50.5NoOneFormer: One Transformer to Rule Universal Ima...2022-11-10Code
12OneFormer (ConvNeXt-XL, single-scale, 640x640)50.1NoOneFormer: One Transformer to Rule Universal Ima...2022-11-10Code
13OneFormer (ConvNeXt-L, single-scale, 640x640)50NoOneFormer: One Transformer to Rule Universal Ima...2022-11-10Code
14OneFormer (Swin-L, single-scale, 640x640)49.8NoOneFormer: One Transformer to Rule Universal Ima...2022-11-10Code
15X-Decoder (L)49.6YesGeneralized Decoding for Pixel, Image, and Langu...2022-12-21Code
16DiNAT-L (Mask2Former, 640x640)49.4NoDilated Neighborhood Attention Transformer2022-09-29Code
17kMaX-DeepLab (ConvNeXt-L, single-scale, 641x641)48.7NokMaX-DeepLab: k-means Mask Transformer2022-07-08Code
18Mask2Former (Swin-L)48.1NoMasked-attention Mask Transformer for Universal ...2021-12-02Code
19Mask2Former (Swin-L + FAPN, 640x640)46.2NoMasked-attention Mask Transformer for Universal ...2021-12-02Code
20kMaX-DeepLab (ResNet50, single-scale, 1281x1281)42.3NokMaX-DeepLab: k-means Mask Transformer2022-07-08Code
21kMaX-DeepLab (ResNet50, single-scale, 641x641)41.5NokMaX-DeepLab: k-means Mask Transformer2022-07-08Code
22Mask2Former (ResNet-50, 640x640)39.7NoMasked-attention Mask Transformer for Universal ...2021-12-02Code
23Panoptic-DeepLab (SwideRNet)37.9NoMasked-attention Mask Transformer for Universal ...2021-12-02Code
24MaskFormer (R101 + 6 Enc)35.7NoPer-Pixel Classification is Not All You Need for...2021-07-13Code