TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Your ViT is Secretly an Image Segmentation Model

Your ViT is Secretly an Image Segmentation Model

Tommie Kerssies, Niccolò Cavagnero, Alexander Hermans, Narges Norouzi, Giuseppe Averta, Bastian Leibe, Gijs Dubbelman, Daan de Geus

2025-03-24CVPR 2025 1SegmentationSemantic SegmentationImage Segmentation
PaperPDFCode(official)

Abstract

Vision Transformers (ViTs) have shown remarkable performance and scalability across various computer vision tasks. To apply single-scale ViTs to image segmentation, existing methods adopt a convolutional adapter to generate multi-scale features, a pixel decoder to fuse these features, and a Transformer decoder that uses the fused features to make predictions. In this paper, we show that the inductive biases introduced by these task-specific components can instead be learned by the ViT itself, given sufficiently large models and extensive pre-training. Based on these findings, we introduce the Encoder-only Mask Transformer (EoMT), which repurposes the plain ViT architecture to conduct image segmentation. With large-scale models and pre-training, EoMT obtains a segmentation accuracy similar to state-of-the-art models that use task-specific components. At the same time, EoMT is significantly faster than these methods due to its architectural simplicity, e.g., up to 4x faster with ViT-L. Across a range of model sizes, EoMT demonstrates an optimal balance between segmentation accuracy and prediction speed, suggesting that compute resources are better spent on scaling the ViT itself rather than adding architectural complexity. Code: https://www.tue-mps.org/eomt/.

Results

TaskDatasetMetricValueModel
Semantic SegmentationCityscapes valFPS25EoMT (DINOv2-L, single-scale, 1024x1024)
Semantic SegmentationCityscapes valValidation mIoU84.2EoMT (DINOv2-L, single-scale, 1024x1024)
Semantic SegmentationCityscapes valmIoU84.2EoMT (DINOv2-L, single-scale, 1024x1024)
Semantic SegmentationADE20K valmIoU58.4EoMT (DINOv2-L, single-scale, 512x512)
Semantic SegmentationADE20KGFLOPs721EoMT (DINOv2-L, single-scale, 512x512)
Semantic SegmentationADE20KGFLOPs (512 x 512)721EoMT (DINOv2-L, single-scale, 512x512)
Semantic SegmentationADE20KMean IoU (class)58.4EoMT (DINOv2-L, single-scale, 512x512)
Semantic SegmentationADE20KParams (M)316EoMT (DINOv2-L, single-scale, 512x512)
Semantic SegmentationADE20KValidation mIoU58.4EoMT (DINOv2-L, single-scale, 512x512)
Semantic SegmentationADE20K valPQ52.8EoMT (DINOv2-g, single-scale, 1280x1280, COCO pre-trained)
Semantic SegmentationCOCO minivalPQ59.2EoMT (DINOv2-g, single-scale, 1280x1280)
10-shot image generationCityscapes valFPS25EoMT (DINOv2-L, single-scale, 1024x1024)
10-shot image generationCityscapes valValidation mIoU84.2EoMT (DINOv2-L, single-scale, 1024x1024)
10-shot image generationCityscapes valmIoU84.2EoMT (DINOv2-L, single-scale, 1024x1024)
10-shot image generationADE20K valmIoU58.4EoMT (DINOv2-L, single-scale, 512x512)
10-shot image generationADE20KGFLOPs721EoMT (DINOv2-L, single-scale, 512x512)
10-shot image generationADE20KGFLOPs (512 x 512)721EoMT (DINOv2-L, single-scale, 512x512)
10-shot image generationADE20KMean IoU (class)58.4EoMT (DINOv2-L, single-scale, 512x512)
10-shot image generationADE20KParams (M)316EoMT (DINOv2-L, single-scale, 512x512)
10-shot image generationADE20KValidation mIoU58.4EoMT (DINOv2-L, single-scale, 512x512)
10-shot image generationADE20K valPQ52.8EoMT (DINOv2-g, single-scale, 1280x1280, COCO pre-trained)
10-shot image generationCOCO minivalPQ59.2EoMT (DINOv2-g, single-scale, 1280x1280)
Panoptic SegmentationADE20K valPQ52.8EoMT (DINOv2-g, single-scale, 1280x1280, COCO pre-trained)
Panoptic SegmentationCOCO minivalPQ59.2EoMT (DINOv2-g, single-scale, 1280x1280)

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21Deep Learning-Based Fetal Lung Segmentation from Diffusion-weighted MRI Images and Lung Maturity Evaluation for Fetal Growth Restriction2025-07-17DiffOSeg: Omni Medical Image Segmentation via Multi-Expert Collaboration Diffusion Model2025-07-17From Variability To Accuracy: Conditional Bernoulli Diffusion Models with Consensus-Driven Correction for Thin Structure Segmentation2025-07-17Unleashing Vision Foundation Models for Coronary Artery Segmentation: Parallel ViT-CNN Encoding and Variational Fusion2025-07-17SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation2025-07-17Unified Medical Image Segmentation with State Space Modeling Snake2025-07-17A Privacy-Preserving Semantic-Segmentation Method Using Domain-Adaptation Technique2025-07-17