TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Extract Free Dense Labels from CLIP

Extract Free Dense Labels from CLIP

Chong Zhou, Chen Change Loy, Bo Dai

2021-12-02Zero-Shot Semantic SegmentationUnsupervised Semantic Segmentation with Language-image Pre-trainingZero Shot SegmentationSegmentationSemantic SegmentationOpen Vocabulary Panoptic Segmentation
PaperPDFCode(official)

Abstract

Contrastive Language-Image Pre-training (CLIP) has made a remarkable breakthrough in open-vocabulary zero-shot image recognition. Many recent studies leverage the pre-trained CLIP models for image-level classification and manipulation. In this paper, we wish examine the intrinsic potential of CLIP for pixel-level dense prediction, specifically in semantic segmentation. To this end, with minimal modification, we show that MaskCLIP yields compelling segmentation results on open concepts across various datasets in the absence of annotations and fine-tuning. By adding pseudo labeling and self-training, MaskCLIP+ surpasses SOTA transductive zero-shot semantic segmentation methods by large margins, e.g., mIoUs of unseen classes on PASCAL VOC/PASCAL Context/COCO Stuff are improved from 35.6/20.7/30.3 to 86.1/66.7/54.7. We also test the robustness of MaskCLIP under input corruption and evaluate its capability in discriminating fine-grained objects and novel concepts. Our finding suggests that MaskCLIP can serve as a new reliable source of supervision for dense prediction tasks to achieve annotation-free segmentation. Source code is available at https://github.com/chongzhou96/MaskCLIP.

Results

TaskDatasetMetricValueModel
Semantic SegmentationCC3M-TagMaskmIoU41MaskCLIP
Semantic SegmentationCOCO-Stuff-171mIoU16.4MaskCLIP
Semantic SegmentationCOCO-ObjectmIoU20.6MaskCLIP
Semantic SegmentationADE20KMean IoU (val)9.8MaskCLIP
Semantic SegmentationCityscapes valmIoU10MaskCLIP
Semantic SegmentationCityscapes valpixel accuracy35.9MaskCLIP
Semantic SegmentationPASCAL Context-59mIoU26.4MaskCLIP
Semantic SegmentationPascalVOC-20mIoU74.9MaskCLIP
Semantic SegmentationPASCAL VOCmIoU29.3MaskCLIP
Semantic SegmentationKITTI-STEPmIoU15.3DenseCLIP
Semantic SegmentationKITTI-STEPpixel accuracy34.1DenseCLIP
Semantic SegmentationCOCO-Stuff-27mIoU19.6DenseCLIP
Semantic SegmentationCOCO-Stuff-27pixel accuracy32.2DenseCLIP
Open Vocabulary Panoptic SegmentationADE20KPQ15.1MaskCLIP
Zero Shot SegmentationADE20K training-free zero-shot segmentationmIoU10.2MaskCLIP
Unsupervised Semantic SegmentationCOCO-Stuff-171mIoU16.4MaskCLIP
Unsupervised Semantic SegmentationCOCO-ObjectmIoU20.6MaskCLIP
Unsupervised Semantic SegmentationADE20KMean IoU (val)9.8MaskCLIP
Unsupervised Semantic SegmentationCityscapes valmIoU10MaskCLIP
Unsupervised Semantic SegmentationCityscapes valpixel accuracy35.9MaskCLIP
Unsupervised Semantic SegmentationPASCAL Context-59mIoU26.4MaskCLIP
Unsupervised Semantic SegmentationPascalVOC-20mIoU74.9MaskCLIP
Unsupervised Semantic SegmentationPASCAL VOCmIoU29.3MaskCLIP
Unsupervised Semantic SegmentationKITTI-STEPmIoU15.3DenseCLIP
Unsupervised Semantic SegmentationKITTI-STEPpixel accuracy34.1DenseCLIP
Unsupervised Semantic SegmentationCOCO-Stuff-27mIoU19.6DenseCLIP
Unsupervised Semantic SegmentationCOCO-Stuff-27pixel accuracy32.2DenseCLIP
Open Vocabulary Semantic SegmentationPASCAL Context-459mIoU10MaskCLIP
10-shot image generationCC3M-TagMaskmIoU41MaskCLIP
10-shot image generationCOCO-Stuff-171mIoU16.4MaskCLIP
10-shot image generationCOCO-ObjectmIoU20.6MaskCLIP
10-shot image generationADE20KMean IoU (val)9.8MaskCLIP
10-shot image generationCityscapes valmIoU10MaskCLIP
10-shot image generationCityscapes valpixel accuracy35.9MaskCLIP
10-shot image generationPASCAL Context-59mIoU26.4MaskCLIP
10-shot image generationPascalVOC-20mIoU74.9MaskCLIP
10-shot image generationPASCAL VOCmIoU29.3MaskCLIP
10-shot image generationKITTI-STEPmIoU15.3DenseCLIP
10-shot image generationKITTI-STEPpixel accuracy34.1DenseCLIP
10-shot image generationCOCO-Stuff-27mIoU19.6DenseCLIP
10-shot image generationCOCO-Stuff-27pixel accuracy32.2DenseCLIP
Zero-Shot Semantic SegmentationPASCAL VOCTransductive Setting hIoU87.4MaskCLIP+
Zero-Shot Semantic SegmentationCOCO-StuffTransductive Setting hIoU45MaskCLIP+

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21Deep Learning-Based Fetal Lung Segmentation from Diffusion-weighted MRI Images and Lung Maturity Evaluation for Fetal Growth Restriction2025-07-17DiffOSeg: Omni Medical Image Segmentation via Multi-Expert Collaboration Diffusion Model2025-07-17From Variability To Accuracy: Conditional Bernoulli Diffusion Models with Consensus-Driven Correction for Thin Structure Segmentation2025-07-17Unleashing Vision Foundation Models for Coronary Artery Segmentation: Parallel ViT-CNN Encoding and Variational Fusion2025-07-17SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation2025-07-17Unified Medical Image Segmentation with State Space Modeling Snake2025-07-17A Privacy-Preserving Semantic-Segmentation Method Using Domain-Adaptation Technique2025-07-17