Your ViT is Secretly an Image Segmentation Model

Tommie Kerssies, Niccolò Cavagnero, Alexander Hermans, Narges Norouzi, Giuseppe Averta, Bastian Leibe, Gijs Dubbelman, Daan de Geus

2025-03-24CVPR 2025 1Segmentation Semantic Segmentation Image Segmentation

Paper PDF Code(official)

Abstract

Vision Transformers (ViTs) have shown remarkable performance and scalability across various computer vision tasks. To apply single-scale ViTs to image segmentation, existing methods adopt a convolutional adapter to generate multi-scale features, a pixel decoder to fuse these features, and a Transformer decoder that uses the fused features to make predictions. In this paper, we show that the inductive biases introduced by these task-specific components can instead be learned by the ViT itself, given sufficiently large models and extensive pre-training. Based on these findings, we introduce the Encoder-only Mask Transformer (EoMT), which repurposes the plain ViT architecture to conduct image segmentation. With large-scale models and pre-training, EoMT obtains a segmentation accuracy similar to state-of-the-art models that use task-specific components. At the same time, EoMT is significantly faster than these methods due to its architectural simplicity, e.g., up to 4x faster with ViT-L. Across a range of model sizes, EoMT demonstrates an optimal balance between segmentation accuracy and prediction speed, suggesting that compute resources are better spent on scaling the ViT itself rather than adding architectural complexity. Code: https://www.tue-mps.org/eomt/.

Results

Task	Dataset	Metric	Value	Model
Semantic Segmentation	Cityscapes val	FPS	25	EoMT (DINOv2-L, single-scale, 1024x1024)
Semantic Segmentation	Cityscapes val	Validation mIoU	84.2	EoMT (DINOv2-L, single-scale, 1024x1024)
Semantic Segmentation	Cityscapes val	mIoU	84.2	EoMT (DINOv2-L, single-scale, 1024x1024)
Semantic Segmentation	ADE20K val	mIoU	58.4	EoMT (DINOv2-L, single-scale, 512x512)
Semantic Segmentation	ADE20K	GFLOPs	721	EoMT (DINOv2-L, single-scale, 512x512)
Semantic Segmentation	ADE20K	GFLOPs (512 x 512)	721	EoMT (DINOv2-L, single-scale, 512x512)
Semantic Segmentation	ADE20K	Mean IoU (class)	58.4	EoMT (DINOv2-L, single-scale, 512x512)
Semantic Segmentation	ADE20K	Params (M)	316	EoMT (DINOv2-L, single-scale, 512x512)
Semantic Segmentation	ADE20K	Validation mIoU	58.4	EoMT (DINOv2-L, single-scale, 512x512)
Semantic Segmentation	ADE20K val	PQ	52.8	EoMT (DINOv2-g, single-scale, 1280x1280, COCO pre-trained)
Semantic Segmentation	COCO minival	PQ	59.2	EoMT (DINOv2-g, single-scale, 1280x1280)
10-shot image generation	Cityscapes val	FPS	25	EoMT (DINOv2-L, single-scale, 1024x1024)
10-shot image generation	Cityscapes val	Validation mIoU	84.2	EoMT (DINOv2-L, single-scale, 1024x1024)
10-shot image generation	Cityscapes val	mIoU	84.2	EoMT (DINOv2-L, single-scale, 1024x1024)
10-shot image generation	ADE20K val	mIoU	58.4	EoMT (DINOv2-L, single-scale, 512x512)
10-shot image generation	ADE20K	GFLOPs	721	EoMT (DINOv2-L, single-scale, 512x512)
10-shot image generation	ADE20K	GFLOPs (512 x 512)	721	EoMT (DINOv2-L, single-scale, 512x512)
10-shot image generation	ADE20K	Mean IoU (class)	58.4	EoMT (DINOv2-L, single-scale, 512x512)
10-shot image generation	ADE20K	Params (M)	316	EoMT (DINOv2-L, single-scale, 512x512)
10-shot image generation	ADE20K	Validation mIoU	58.4	EoMT (DINOv2-L, single-scale, 512x512)
10-shot image generation	ADE20K val	PQ	52.8	EoMT (DINOv2-g, single-scale, 1280x1280, COCO pre-trained)
10-shot image generation	COCO minival	PQ	59.2	EoMT (DINOv2-g, single-scale, 1280x1280)
Panoptic Segmentation	ADE20K val	PQ	52.8	EoMT (DINOv2-g, single-scale, 1280x1280, COCO pre-trained)
Panoptic Segmentation	COCO minival	PQ	59.2	EoMT (DINOv2-g, single-scale, 1280x1280)

Your ViT is Secretly an Image Segmentation Model

Abstract

Results

Related Papers

Your ViT is Secretly an Image Segmentation Model

Abstract

Results

Related Papers