ProxyCLIP: Proxy Attention Improves CLIP for Open-Vocabulary Segmentation

Mengcheng Lan, Chaofeng Chen, Yiping Ke, Xinjiang Wang, Litong Feng, Wayne Zhang

2024-08-09Unsupervised Semantic Segmentation with Language-image Pre-training Open Vocabulary Semantic Segmentation Segmentation Semantic Segmentation Open-Vocabulary Semantic Segmentation

Paper PDF Code(official)

Abstract

Open-vocabulary semantic segmentation requires models to effectively integrate visual representations with open-vocabulary semantic labels. While Contrastive Language-Image Pre-training (CLIP) models shine in recognizing visual concepts from text, they often struggle with segment coherence due to their limited localization ability. In contrast, Vision Foundation Models (VFMs) excel at acquiring spatially consistent local visual representations, yet they fall short in semantic understanding. This paper introduces ProxyCLIP, an innovative framework designed to harmonize the strengths of both CLIP and VFMs, facilitating enhanced open-vocabulary semantic segmentation. ProxyCLIP leverages the spatial feature correspondence from VFMs as a form of proxy attention to augment CLIP, thereby inheriting the VFMs' robust local consistency and maintaining CLIP's exceptional zero-shot transfer capacity. We propose an adaptive normalization and masking strategy to get the proxy attention from VFMs, allowing for adaptation across different VFMs. Remarkably, as a training-free approach, ProxyCLIP significantly improves the average mean Intersection over Union (mIoU) across eight benchmarks from 40.3 to 44.4, showcasing its exceptional efficacy in bridging the gap between spatial precision and semantic richness for the open-vocabulary segmentation task.

Results

Task	Dataset	Metric	Value	Model
Semantic Segmentation	COCO-Stuff-171	mIoU	26.8	ProxyCLIP
Semantic Segmentation	COCO-Object	mIoU	39.2	ProxyCLIP
Semantic Segmentation	ADE20K	Mean IoU (val)	24.2	ProxyCLIP
Semantic Segmentation	Cityscapes val	mIoU	42	ProxyCLIP
Semantic Segmentation	PASCAL Context-59	mIoU	39.6	ProxyCLIP
Semantic Segmentation	PASCAL Context-60	mIoU	35.4	ProxyCLIP
Semantic Segmentation	PascalVOC-20	mIoU	83.3	ProxyCLIP
Semantic Segmentation	PASCAL VOC	mIoU	65	ProxyCLIP
Unsupervised Semantic Segmentation	COCO-Stuff-171	mIoU	26.8	ProxyCLIP
Unsupervised Semantic Segmentation	COCO-Object	mIoU	39.2	ProxyCLIP
Unsupervised Semantic Segmentation	ADE20K	Mean IoU (val)	24.2	ProxyCLIP
Unsupervised Semantic Segmentation	Cityscapes val	mIoU	42	ProxyCLIP
Unsupervised Semantic Segmentation	PASCAL Context-59	mIoU	39.6	ProxyCLIP
Unsupervised Semantic Segmentation	PASCAL Context-60	mIoU	35.4	ProxyCLIP
Unsupervised Semantic Segmentation	PascalVOC-20	mIoU	83.3	ProxyCLIP
Unsupervised Semantic Segmentation	PASCAL VOC	mIoU	65	ProxyCLIP
10-shot image generation	COCO-Stuff-171	mIoU	26.8	ProxyCLIP
10-shot image generation	COCO-Object	mIoU	39.2	ProxyCLIP
10-shot image generation	ADE20K	Mean IoU (val)	24.2	ProxyCLIP
10-shot image generation	Cityscapes val	mIoU	42	ProxyCLIP
10-shot image generation	PASCAL Context-59	mIoU	39.6	ProxyCLIP
10-shot image generation	PASCAL Context-60	mIoU	35.4	ProxyCLIP
10-shot image generation	PascalVOC-20	mIoU	83.3	ProxyCLIP
10-shot image generation	PASCAL VOC	mIoU	65	ProxyCLIP

ProxyCLIP: Proxy Attention Improves CLIP for Open-Vocabulary Segmentation

Abstract

Results

Related Papers

ProxyCLIP: Proxy Attention Improves CLIP for Open-Vocabulary Segmentation

Abstract

Results

Related Papers