The Missing Point in Vision Transformers for Universal Image Segmentation

Sajjad Shahabodini, Mobina Mansoori, Farnoush Bayatmakou, Jamshid Abouei, Konstantinos N. Plataniotis, Arash Mohammadi

2025-05-26Panoptic Segmentation Segmentation Semantic Segmentation Instance Segmentation Image Segmentation

Abstract

Image segmentation remains a challenging task in computer vision, demanding robust mask generation and precise classification. Recent mask-based approaches yield high-quality masks by capturing global context. However, accurately classifying these masks, especially in the presence of ambiguous boundaries and imbalanced class distributions, remains an open challenge. In this work, we introduce ViT-P, a novel two-stage segmentation framework that decouples mask generation from classification. The first stage employs a proposal generator to produce class-agnostic mask proposals, while the second stage utilizes a point-based classification model built on the Vision Transformer (ViT) to refine predictions by focusing on mask central points. ViT-P serves as a pre-training-free adapter, allowing the integration of various pre-trained vision transformers without modifying their architecture, ensuring adaptability to dense prediction tasks. Furthermore, we demonstrate that coarse and bounding box annotations can effectively enhance classification without requiring additional training on fine annotation datasets, reducing annotation costs while maintaining strong performance. Extensive experiments across COCO, ADE20K, and Cityscapes datasets validate the effectiveness of ViT-P, achieving state-of-the-art results with 54.0 PQ on ADE20K panoptic segmentation, 87.4 mIoU on Cityscapes semantic segmentation, and 63.6 mIoU on ADE20K semantic segmentation. The code and pretrained models are available at: https://github.com/sajjad-sh33/ViT-P}{https://github.com/sajjad-sh33/ViT-P.

Results

Task	Dataset	Metric	Value	Model
Semantic Segmentation	COCO (Common Objects in Context)	mIoU	69.1	ViT-P (OneFormer, InternImage-H)
Semantic Segmentation	COCO (Common Objects in Context)	mIoU	68.8	ViT-P (OneFormer, DiNAT-L)
Semantic Segmentation	Cityscapes val	mIoU	87.4	ViT-P (InternImage-H)
Semantic Segmentation	COCO-Stuff test	mIoU	53.5	ViT-P (InternImage-H)
Semantic Segmentation	ADE20K	Params (M)	1610	ViT-P (InternImage-H)
Semantic Segmentation	ADE20K	Validation mIoU	63.6	ViT-P (InternImage-H)
Semantic Segmentation	ADE20K	Params (M)	1400	ViT-P (OneFormer, InternImage-H)
Semantic Segmentation	ADE20K	Validation mIoU	61.6	ViT-P (OneFormer, InternImage-H)
Semantic Segmentation	ADE20K	Params (M)	309	ViT-P (OneFormer, DiNAT-L)
Semantic Segmentation	ADE20K	Validation mIoU	59.9	ViT-P (OneFormer, DiNAT-L)
Semantic Segmentation	Cityscapes val	AP	50.6	ViT-P (OneFormer, InternImage-H)
Semantic Segmentation	Cityscapes val	PQ	70.8	ViT-P (OneFormer, InternImage-H)
Semantic Segmentation	Cityscapes val	mIoU	85.4	ViT-P (OneFormer, InternImage-H)
Semantic Segmentation	ADE20K val	PQ	54	ViT-P (OneFormer, DiNAT-L, single-scale, 1280x1280, COCO_pretrain)
Semantic Segmentation	ADE20K val	PQ	51.9	ViT-P (OneFormer, DiNAT-L, single-scale, 1280x1280)
Instance Segmentation	Cityscapes val	AP	49	ViT-P (OneFormer, ConvNeXt-L, single-scale, 512x1024, Mapillary Vistas-pretrained)
Instance Segmentation	Cityscapes val	mask AP	49	ViT-P (OneFormer, ConvNeXt-L, single-scale, 512x1024, Mapillary Vistas-pretrained)
Instance Segmentation	ADE20K val	AP	40.7	ViT-P (OneFormer, DiNAT-L, single-scale, 1280x1280, COCO_pretrain)
Instance Segmentation	ADE20K val	AP	37.8	ViT-P (OneFormer, DiNAT-L, single-scale, 1280x1280)
10-shot image generation	COCO (Common Objects in Context)	mIoU	69.1	ViT-P (OneFormer, InternImage-H)
10-shot image generation	COCO (Common Objects in Context)	mIoU	68.8	ViT-P (OneFormer, DiNAT-L)
10-shot image generation	Cityscapes val	mIoU	87.4	ViT-P (InternImage-H)
10-shot image generation	COCO-Stuff test	mIoU	53.5	ViT-P (InternImage-H)
10-shot image generation	ADE20K	Params (M)	1610	ViT-P (InternImage-H)
10-shot image generation	ADE20K	Validation mIoU	63.6	ViT-P (InternImage-H)
10-shot image generation	ADE20K	Params (M)	1400	ViT-P (OneFormer, InternImage-H)
10-shot image generation	ADE20K	Validation mIoU	61.6	ViT-P (OneFormer, InternImage-H)
10-shot image generation	ADE20K	Params (M)	309	ViT-P (OneFormer, DiNAT-L)
10-shot image generation	ADE20K	Validation mIoU	59.9	ViT-P (OneFormer, DiNAT-L)
10-shot image generation	Cityscapes val	AP	50.6	ViT-P (OneFormer, InternImage-H)
10-shot image generation	Cityscapes val	PQ	70.8	ViT-P (OneFormer, InternImage-H)
10-shot image generation	Cityscapes val	mIoU	85.4	ViT-P (OneFormer, InternImage-H)
10-shot image generation	ADE20K val	PQ	54	ViT-P (OneFormer, DiNAT-L, single-scale, 1280x1280, COCO_pretrain)
10-shot image generation	ADE20K val	PQ	51.9	ViT-P (OneFormer, DiNAT-L, single-scale, 1280x1280)
Panoptic Segmentation	Cityscapes val	AP	50.6	ViT-P (OneFormer, InternImage-H)
Panoptic Segmentation	Cityscapes val	PQ	70.8	ViT-P (OneFormer, InternImage-H)
Panoptic Segmentation	Cityscapes val	mIoU	85.4	ViT-P (OneFormer, InternImage-H)
Panoptic Segmentation	ADE20K val	PQ	54	ViT-P (OneFormer, DiNAT-L, single-scale, 1280x1280, COCO_pretrain)
Panoptic Segmentation	ADE20K val	PQ	51.9	ViT-P (OneFormer, DiNAT-L, single-scale, 1280x1280)

Abstract

Results

Task	Dataset	Metric	Value	Model
Semantic Segmentation	COCO (Common Objects in Context)	mIoU	69.1	ViT-P (OneFormer, InternImage-H)
Semantic Segmentation	COCO (Common Objects in Context)	mIoU	68.8	ViT-P (OneFormer, DiNAT-L)
Semantic Segmentation	Cityscapes val	mIoU	87.4	ViT-P (InternImage-H)
Semantic Segmentation	COCO-Stuff test	mIoU	53.5	ViT-P (InternImage-H)
Semantic Segmentation	ADE20K	Params (M)	1610	ViT-P (InternImage-H)
Semantic Segmentation	ADE20K	Validation mIoU	63.6	ViT-P (InternImage-H)
Semantic Segmentation	ADE20K	Params (M)	1400	ViT-P (OneFormer, InternImage-H)
Semantic Segmentation	ADE20K	Validation mIoU	61.6	ViT-P (OneFormer, InternImage-H)
Semantic Segmentation	ADE20K	Params (M)	309	ViT-P (OneFormer, DiNAT-L)
Semantic Segmentation	ADE20K	Validation mIoU	59.9	ViT-P (OneFormer, DiNAT-L)
Semantic Segmentation	Cityscapes val	AP	50.6	ViT-P (OneFormer, InternImage-H)
Semantic Segmentation	Cityscapes val	PQ	70.8	ViT-P (OneFormer, InternImage-H)
Semantic Segmentation	Cityscapes val	mIoU	85.4	ViT-P (OneFormer, InternImage-H)
Semantic Segmentation	ADE20K val	PQ	54	ViT-P (OneFormer, DiNAT-L, single-scale, 1280x1280, COCO_pretrain)
Semantic Segmentation	ADE20K val	PQ	51.9	ViT-P (OneFormer, DiNAT-L, single-scale, 1280x1280)
Instance Segmentation	Cityscapes val	AP	49	ViT-P (OneFormer, ConvNeXt-L, single-scale, 512x1024, Mapillary Vistas-pretrained)
Instance Segmentation	Cityscapes val	mask AP	49	ViT-P (OneFormer, ConvNeXt-L, single-scale, 512x1024, Mapillary Vistas-pretrained)
Instance Segmentation	ADE20K val	AP	40.7	ViT-P (OneFormer, DiNAT-L, single-scale, 1280x1280, COCO_pretrain)
Instance Segmentation	ADE20K val	AP	37.8	ViT-P (OneFormer, DiNAT-L, single-scale, 1280x1280)
10-shot image generation	COCO (Common Objects in Context)	mIoU	69.1	ViT-P (OneFormer, InternImage-H)
10-shot image generation	COCO (Common Objects in Context)	mIoU	68.8	ViT-P (OneFormer, DiNAT-L)
10-shot image generation	Cityscapes val	mIoU	87.4	ViT-P (InternImage-H)
10-shot image generation	COCO-Stuff test	mIoU	53.5	ViT-P (InternImage-H)
10-shot image generation	ADE20K	Params (M)	1610	ViT-P (InternImage-H)
10-shot image generation	ADE20K	Validation mIoU	63.6	ViT-P (InternImage-H)
10-shot image generation	ADE20K	Params (M)	1400	ViT-P (OneFormer, InternImage-H)
10-shot image generation	ADE20K	Validation mIoU	61.6	ViT-P (OneFormer, InternImage-H)
10-shot image generation	ADE20K	Params (M)	309	ViT-P (OneFormer, DiNAT-L)
10-shot image generation	ADE20K	Validation mIoU	59.9	ViT-P (OneFormer, DiNAT-L)
10-shot image generation	Cityscapes val	AP	50.6	ViT-P (OneFormer, InternImage-H)
10-shot image generation	Cityscapes val	PQ	70.8	ViT-P (OneFormer, InternImage-H)
10-shot image generation	Cityscapes val	mIoU	85.4	ViT-P (OneFormer, InternImage-H)
10-shot image generation	ADE20K val	PQ	54	ViT-P (OneFormer, DiNAT-L, single-scale, 1280x1280, COCO_pretrain)
10-shot image generation	ADE20K val	PQ	51.9	ViT-P (OneFormer, DiNAT-L, single-scale, 1280x1280)
Panoptic Segmentation	Cityscapes val	AP	50.6	ViT-P (OneFormer, InternImage-H)
Panoptic Segmentation	Cityscapes val	PQ	70.8	ViT-P (OneFormer, InternImage-H)
Panoptic Segmentation	Cityscapes val	mIoU	85.4	ViT-P (OneFormer, InternImage-H)
Panoptic Segmentation	ADE20K val	PQ	54	ViT-P (OneFormer, DiNAT-L, single-scale, 1280x1280, COCO_pretrain)
Panoptic Segmentation	ADE20K val	PQ	51.9	ViT-P (OneFormer, DiNAT-L, single-scale, 1280x1280)

The Missing Point in Vision Transformers for Universal Image Segmentation

Abstract

Results

Related Papers

The Missing Point in Vision Transformers for Universal Image Segmentation

Abstract

Results

Related Papers