CorrCLIP: Reconstructing Correlations in CLIP with Off-the-Shelf Foundation Models for Open-Vocabulary Semantic Segmentation

Dengke Zhang, Fagui Liu, Quan Tang

2024-11-15Unsupervised Semantic Segmentation with Language-image Pre-training Open Vocabulary Semantic Segmentation Segmentation Semantic Segmentation Open-Vocabulary Semantic Segmentation Zero-Shot Learning

Paper PDF Code(official)

Abstract

Open-vocabulary semantic segmentation aims to assign semantic labels to each pixel without relying on a predefined set of categories. Contrastive Language-Image Pre-training (CLIP) demonstrates outstanding zero-shot classification capabilities but struggles with the pixel-wise segmentation task as the captured inter-patch correlations correspond to no specific visual concepts. Despite previous CLIP-based works improving inter-patch correlations by self-self attention, they still face the inherent limitation that image patches tend to have high similarity to outlier ones. In this work, we introduce CorrCLIP, a training-free approach for open-vocabulary semantic segmentation, which reconstructs significantly coherent inter-patch correlations utilizing foundation models. Specifically, it employs the Segment Anything Model (SAM) to define the scope of patch interactions, ensuring that patches interact only with semantically similar ones. Furthermore, CorrCLIP obtains an understanding of an image's semantic layout via self-supervised models to determine concrete similarity values between image patches, which addresses the similarity irregularity problem caused by the aforementioned restricted patch interaction regime. Finally, CorrCLIP reuses the region masks produced by SAM to update the segmentation map. As a training-free method, CorrCLIP achieves a notable improvement across eight challenging benchmarks regarding the averaged mean Intersection over Union, boosting it from 44.4% to 51.0%.

Results

Task	Dataset	Metric	Value	Model
Semantic Segmentation	COCO-Stuff-171	mIoU	34	CorrCLIP
Semantic Segmentation	COCO-Object	mIoU	49.4	CorrCLIP
Semantic Segmentation	ADE20K	Mean IoU (val)	30.7	CorrCLIP
Semantic Segmentation	Cityscapes val	mIoU	51.1	CorrCLIP
Semantic Segmentation	PASCAL Context-59	mIoU	50.8	CorrCLIP
Semantic Segmentation	PASCAL Context-60	mIoU	44.9	CorrCLIP
Semantic Segmentation	PascalVOC-20	mIoU	91.8	CorrCLIP
Semantic Segmentation	PASCAL VOC	mIoU	76.7	CorrCLIP
Unsupervised Semantic Segmentation	COCO-Stuff-171	mIoU	34	CorrCLIP
Unsupervised Semantic Segmentation	COCO-Object	mIoU	49.4	CorrCLIP
Unsupervised Semantic Segmentation	ADE20K	Mean IoU (val)	30.7	CorrCLIP
Unsupervised Semantic Segmentation	Cityscapes val	mIoU	51.1	CorrCLIP
Unsupervised Semantic Segmentation	PASCAL Context-59	mIoU	50.8	CorrCLIP
Unsupervised Semantic Segmentation	PASCAL Context-60	mIoU	44.9	CorrCLIP
Unsupervised Semantic Segmentation	PascalVOC-20	mIoU	91.8	CorrCLIP
Unsupervised Semantic Segmentation	PASCAL VOC	mIoU	76.7	CorrCLIP
10-shot image generation	COCO-Stuff-171	mIoU	34	CorrCLIP
10-shot image generation	COCO-Object	mIoU	49.4	CorrCLIP
10-shot image generation	ADE20K	Mean IoU (val)	30.7	CorrCLIP
10-shot image generation	Cityscapes val	mIoU	51.1	CorrCLIP
10-shot image generation	PASCAL Context-59	mIoU	50.8	CorrCLIP
10-shot image generation	PASCAL Context-60	mIoU	44.9	CorrCLIP
10-shot image generation	PascalVOC-20	mIoU	91.8	CorrCLIP
10-shot image generation	PASCAL VOC	mIoU	76.7	CorrCLIP

CorrCLIP: Reconstructing Correlations in CLIP with Off-the-Shelf Foundation Models for Open-Vocabulary Semantic Segmentation

Abstract

Results

Related Papers

CorrCLIP: Reconstructing Correlations in CLIP with Off-the-Shelf Foundation Models for Open-Vocabulary Semantic Segmentation

Abstract

Results

Related Papers