Extract Free Dense Labels from CLIP

Chong Zhou, Chen Change Loy, Bo Dai

2021-12-02Zero-Shot Semantic Segmentation Unsupervised Semantic Segmentation with Language-image Pre-training Zero Shot Segmentation Segmentation Semantic Segmentation Open Vocabulary Panoptic Segmentation

Paper PDF Code(official)

Abstract

Contrastive Language-Image Pre-training (CLIP) has made a remarkable breakthrough in open-vocabulary zero-shot image recognition. Many recent studies leverage the pre-trained CLIP models for image-level classification and manipulation. In this paper, we wish examine the intrinsic potential of CLIP for pixel-level dense prediction, specifically in semantic segmentation. To this end, with minimal modification, we show that MaskCLIP yields compelling segmentation results on open concepts across various datasets in the absence of annotations and fine-tuning. By adding pseudo labeling and self-training, MaskCLIP+ surpasses SOTA transductive zero-shot semantic segmentation methods by large margins, e.g., mIoUs of unseen classes on PASCAL VOC/PASCAL Context/COCO Stuff are improved from 35.6/20.7/30.3 to 86.1/66.7/54.7. We also test the robustness of MaskCLIP under input corruption and evaluate its capability in discriminating fine-grained objects and novel concepts. Our finding suggests that MaskCLIP can serve as a new reliable source of supervision for dense prediction tasks to achieve annotation-free segmentation. Source code is available at https://github.com/chongzhou96/MaskCLIP.

Results

Task	Dataset	Metric	Value	Model
Semantic Segmentation	CC3M-TagMask	mIoU	41	MaskCLIP
Semantic Segmentation	COCO-Stuff-171	mIoU	16.4	MaskCLIP
Semantic Segmentation	COCO-Object	mIoU	20.6	MaskCLIP
Semantic Segmentation	ADE20K	Mean IoU (val)	9.8	MaskCLIP
Semantic Segmentation	Cityscapes val	mIoU	10	MaskCLIP
Semantic Segmentation	Cityscapes val	pixel accuracy	35.9	MaskCLIP
Semantic Segmentation	PASCAL Context-59	mIoU	26.4	MaskCLIP
Semantic Segmentation	PascalVOC-20	mIoU	74.9	MaskCLIP
Semantic Segmentation	PASCAL VOC	mIoU	29.3	MaskCLIP
Semantic Segmentation	KITTI-STEP	mIoU	15.3	DenseCLIP
Semantic Segmentation	KITTI-STEP	pixel accuracy	34.1	DenseCLIP
Semantic Segmentation	COCO-Stuff-27	mIoU	19.6	DenseCLIP
Semantic Segmentation	COCO-Stuff-27	pixel accuracy	32.2	DenseCLIP
Open Vocabulary Panoptic Segmentation	ADE20K	PQ	15.1	MaskCLIP
Zero Shot Segmentation	ADE20K training-free zero-shot segmentation	mIoU	10.2	MaskCLIP
Unsupervised Semantic Segmentation	COCO-Stuff-171	mIoU	16.4	MaskCLIP
Unsupervised Semantic Segmentation	COCO-Object	mIoU	20.6	MaskCLIP
Unsupervised Semantic Segmentation	ADE20K	Mean IoU (val)	9.8	MaskCLIP
Unsupervised Semantic Segmentation	Cityscapes val	mIoU	10	MaskCLIP
Unsupervised Semantic Segmentation	Cityscapes val	pixel accuracy	35.9	MaskCLIP
Unsupervised Semantic Segmentation	PASCAL Context-59	mIoU	26.4	MaskCLIP
Unsupervised Semantic Segmentation	PascalVOC-20	mIoU	74.9	MaskCLIP
Unsupervised Semantic Segmentation	PASCAL VOC	mIoU	29.3	MaskCLIP
Unsupervised Semantic Segmentation	KITTI-STEP	mIoU	15.3	DenseCLIP
Unsupervised Semantic Segmentation	KITTI-STEP	pixel accuracy	34.1	DenseCLIP
Unsupervised Semantic Segmentation	COCO-Stuff-27	mIoU	19.6	DenseCLIP
Unsupervised Semantic Segmentation	COCO-Stuff-27	pixel accuracy	32.2	DenseCLIP
Open Vocabulary Semantic Segmentation	PASCAL Context-459	mIoU	10	MaskCLIP
10-shot image generation	CC3M-TagMask	mIoU	41	MaskCLIP
10-shot image generation	COCO-Stuff-171	mIoU	16.4	MaskCLIP
10-shot image generation	COCO-Object	mIoU	20.6	MaskCLIP
10-shot image generation	ADE20K	Mean IoU (val)	9.8	MaskCLIP
10-shot image generation	Cityscapes val	mIoU	10	MaskCLIP
10-shot image generation	Cityscapes val	pixel accuracy	35.9	MaskCLIP
10-shot image generation	PASCAL Context-59	mIoU	26.4	MaskCLIP
10-shot image generation	PascalVOC-20	mIoU	74.9	MaskCLIP
10-shot image generation	PASCAL VOC	mIoU	29.3	MaskCLIP
10-shot image generation	KITTI-STEP	mIoU	15.3	DenseCLIP
10-shot image generation	KITTI-STEP	pixel accuracy	34.1	DenseCLIP
10-shot image generation	COCO-Stuff-27	mIoU	19.6	DenseCLIP
10-shot image generation	COCO-Stuff-27	pixel accuracy	32.2	DenseCLIP
Zero-Shot Semantic Segmentation	PASCAL VOC	Transductive Setting hIoU	87.4	MaskCLIP+
Zero-Shot Semantic Segmentation	COCO-Stuff	Transductive Setting hIoU	45	MaskCLIP+

Extract Free Dense Labels from CLIP

Abstract

Results

Related Papers

Extract Free Dense Labels from CLIP

Abstract

Results

Related Papers