Tommie Kerssies, Niccolò Cavagnero, Alexander Hermans, Narges Norouzi, Giuseppe Averta, Bastian Leibe, Gijs Dubbelman, Daan de Geus
Vision Transformers (ViTs) have shown remarkable performance and scalability across various computer vision tasks. To apply single-scale ViTs to image segmentation, existing methods adopt a convolutional adapter to generate multi-scale features, a pixel decoder to fuse these features, and a Transformer decoder that uses the fused features to make predictions. In this paper, we show that the inductive biases introduced by these task-specific components can instead be learned by the ViT itself, given sufficiently large models and extensive pre-training. Based on these findings, we introduce the Encoder-only Mask Transformer (EoMT), which repurposes the plain ViT architecture to conduct image segmentation. With large-scale models and pre-training, EoMT obtains a segmentation accuracy similar to state-of-the-art models that use task-specific components. At the same time, EoMT is significantly faster than these methods due to its architectural simplicity, e.g., up to 4x faster with ViT-L. Across a range of model sizes, EoMT demonstrates an optimal balance between segmentation accuracy and prediction speed, suggesting that compute resources are better spent on scaling the ViT itself rather than adding architectural complexity. Code: https://www.tue-mps.org/eomt/.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Semantic Segmentation | Cityscapes val | FPS | 25 | EoMT (DINOv2-L, single-scale, 1024x1024) |
| Semantic Segmentation | Cityscapes val | Validation mIoU | 84.2 | EoMT (DINOv2-L, single-scale, 1024x1024) |
| Semantic Segmentation | Cityscapes val | mIoU | 84.2 | EoMT (DINOv2-L, single-scale, 1024x1024) |
| Semantic Segmentation | ADE20K val | mIoU | 58.4 | EoMT (DINOv2-L, single-scale, 512x512) |
| Semantic Segmentation | ADE20K | GFLOPs | 721 | EoMT (DINOv2-L, single-scale, 512x512) |
| Semantic Segmentation | ADE20K | GFLOPs (512 x 512) | 721 | EoMT (DINOv2-L, single-scale, 512x512) |
| Semantic Segmentation | ADE20K | Mean IoU (class) | 58.4 | EoMT (DINOv2-L, single-scale, 512x512) |
| Semantic Segmentation | ADE20K | Params (M) | 316 | EoMT (DINOv2-L, single-scale, 512x512) |
| Semantic Segmentation | ADE20K | Validation mIoU | 58.4 | EoMT (DINOv2-L, single-scale, 512x512) |
| Semantic Segmentation | ADE20K val | PQ | 52.8 | EoMT (DINOv2-g, single-scale, 1280x1280, COCO pre-trained) |
| Semantic Segmentation | COCO minival | PQ | 59.2 | EoMT (DINOv2-g, single-scale, 1280x1280) |
| 10-shot image generation | Cityscapes val | FPS | 25 | EoMT (DINOv2-L, single-scale, 1024x1024) |
| 10-shot image generation | Cityscapes val | Validation mIoU | 84.2 | EoMT (DINOv2-L, single-scale, 1024x1024) |
| 10-shot image generation | Cityscapes val | mIoU | 84.2 | EoMT (DINOv2-L, single-scale, 1024x1024) |
| 10-shot image generation | ADE20K val | mIoU | 58.4 | EoMT (DINOv2-L, single-scale, 512x512) |
| 10-shot image generation | ADE20K | GFLOPs | 721 | EoMT (DINOv2-L, single-scale, 512x512) |
| 10-shot image generation | ADE20K | GFLOPs (512 x 512) | 721 | EoMT (DINOv2-L, single-scale, 512x512) |
| 10-shot image generation | ADE20K | Mean IoU (class) | 58.4 | EoMT (DINOv2-L, single-scale, 512x512) |
| 10-shot image generation | ADE20K | Params (M) | 316 | EoMT (DINOv2-L, single-scale, 512x512) |
| 10-shot image generation | ADE20K | Validation mIoU | 58.4 | EoMT (DINOv2-L, single-scale, 512x512) |
| 10-shot image generation | ADE20K val | PQ | 52.8 | EoMT (DINOv2-g, single-scale, 1280x1280, COCO pre-trained) |
| 10-shot image generation | COCO minival | PQ | 59.2 | EoMT (DINOv2-g, single-scale, 1280x1280) |
| Panoptic Segmentation | ADE20K val | PQ | 52.8 | EoMT (DINOv2-g, single-scale, 1280x1280, COCO pre-trained) |
| Panoptic Segmentation | COCO minival | PQ | 59.2 | EoMT (DINOv2-g, single-scale, 1280x1280) |