Harnessing Vision Foundation Models for High-Performance, Training-Free Open Vocabulary Segmentation

Yuheng Shi, Minjing Dong, Chang Xu

2024-11-14Unsupervised Semantic Segmentation with Language-image Pre-training Segmentation Semantic Segmentation

Abstract

While Contrastive Language-Image Pre-training (CLIP) has advanced open-vocabulary predictions, its performance on semantic segmentation remains suboptimal. This shortfall primarily stems from its spatial-invariant semantic features and constrained resolution. While previous adaptations addressed spatial invariance semantic by modifying the self-attention in CLIP's image encoder, the issue of limited resolution remains unexplored. Different from previous segment-then-splice methods that segment sub-images via a sliding window and splice the results, we introduce a splice-then-segment paradigm that incorporates Segment-Anything Model (SAM) to tackle the resolution issue since SAM excels at extracting fine-grained semantic correlations from high-resolution images. Specifically, we introduce Trident, a training-free framework that first splices features extracted by CLIP and DINO from sub-images, then leverages SAM's encoder to create a correlation matrix for global aggregation, enabling a broadened receptive field for effective segmentation. Besides, we propose a refinement strategy for CLIP's coarse segmentation outputs by transforming them into prompts for SAM, further enhancing the segmentation performance. Trident achieves a significant improvement in the mIoU across eight benchmarks compared with the current SOTA, increasing from 44.4 to 48.6.Code is available at https://github.com/YuHengsss/Trident.

Results

Task	Dataset	Metric	Value	Model
Semantic Segmentation	COCO-Stuff-171	mIoU	28.6	Trident
Semantic Segmentation	COCO-Object	mIoU	42.2	Trident
Semantic Segmentation	ADE20K	Mean IoU (val)	26.7	Trident
Semantic Segmentation	Cityscapes val	mIoU	47.6	Trident
Semantic Segmentation	PASCAL Context-59	mIoU	44.3	Trident
Semantic Segmentation	PASCAL Context-60	mIoU	40.1	Trident
Semantic Segmentation	PascalVOC-20	mIoU	88.7	Trident
Semantic Segmentation	PASCAL VOC	mIoU	70.8	Trident
Unsupervised Semantic Segmentation	COCO-Stuff-171	mIoU	28.6	Trident
Unsupervised Semantic Segmentation	COCO-Object	mIoU	42.2	Trident
Unsupervised Semantic Segmentation	ADE20K	Mean IoU (val)	26.7	Trident
Unsupervised Semantic Segmentation	Cityscapes val	mIoU	47.6	Trident
Unsupervised Semantic Segmentation	PASCAL Context-59	mIoU	44.3	Trident
Unsupervised Semantic Segmentation	PASCAL Context-60	mIoU	40.1	Trident
Unsupervised Semantic Segmentation	PascalVOC-20	mIoU	88.7	Trident
Unsupervised Semantic Segmentation	PASCAL VOC	mIoU	70.8	Trident
10-shot image generation	COCO-Stuff-171	mIoU	28.6	Trident
10-shot image generation	COCO-Object	mIoU	42.2	Trident
10-shot image generation	ADE20K	Mean IoU (val)	26.7	Trident
10-shot image generation	Cityscapes val	mIoU	47.6	Trident
10-shot image generation	PASCAL Context-59	mIoU	44.3	Trident
10-shot image generation	PASCAL Context-60	mIoU	40.1	Trident
10-shot image generation	PascalVOC-20	mIoU	88.7	Trident
10-shot image generation	PASCAL VOC	mIoU	70.8	Trident

Harnessing Vision Foundation Models for High-Performance, Training-Free Open Vocabulary Segmentation

Abstract

Results

Related Papers

Harnessing Vision Foundation Models for High-Performance, Training-Free Open Vocabulary Segmentation

Abstract

Results

Related Papers