TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/CorrCLIP: Reconstructing Correlations in CLIP with Off-the...

CorrCLIP: Reconstructing Correlations in CLIP with Off-the-Shelf Foundation Models for Open-Vocabulary Semantic Segmentation

Dengke Zhang, Fagui Liu, Quan Tang

2024-11-15Unsupervised Semantic Segmentation with Language-image Pre-trainingOpen Vocabulary Semantic SegmentationSegmentationSemantic SegmentationOpen-Vocabulary Semantic SegmentationZero-Shot Learning
PaperPDFCode(official)

Abstract

Open-vocabulary semantic segmentation aims to assign semantic labels to each pixel without relying on a predefined set of categories. Contrastive Language-Image Pre-training (CLIP) demonstrates outstanding zero-shot classification capabilities but struggles with the pixel-wise segmentation task as the captured inter-patch correlations correspond to no specific visual concepts. Despite previous CLIP-based works improving inter-patch correlations by self-self attention, they still face the inherent limitation that image patches tend to have high similarity to outlier ones. In this work, we introduce CorrCLIP, a training-free approach for open-vocabulary semantic segmentation, which reconstructs significantly coherent inter-patch correlations utilizing foundation models. Specifically, it employs the Segment Anything Model (SAM) to define the scope of patch interactions, ensuring that patches interact only with semantically similar ones. Furthermore, CorrCLIP obtains an understanding of an image's semantic layout via self-supervised models to determine concrete similarity values between image patches, which addresses the similarity irregularity problem caused by the aforementioned restricted patch interaction regime. Finally, CorrCLIP reuses the region masks produced by SAM to update the segmentation map. As a training-free method, CorrCLIP achieves a notable improvement across eight challenging benchmarks regarding the averaged mean Intersection over Union, boosting it from 44.4% to 51.0%.

Results

TaskDatasetMetricValueModel
Semantic SegmentationCOCO-Stuff-171mIoU34CorrCLIP
Semantic SegmentationCOCO-ObjectmIoU49.4CorrCLIP
Semantic SegmentationADE20KMean IoU (val)30.7CorrCLIP
Semantic SegmentationCityscapes valmIoU51.1CorrCLIP
Semantic SegmentationPASCAL Context-59mIoU50.8CorrCLIP
Semantic SegmentationPASCAL Context-60mIoU44.9CorrCLIP
Semantic SegmentationPascalVOC-20mIoU91.8CorrCLIP
Semantic SegmentationPASCAL VOCmIoU76.7CorrCLIP
Unsupervised Semantic SegmentationCOCO-Stuff-171mIoU34CorrCLIP
Unsupervised Semantic SegmentationCOCO-ObjectmIoU49.4CorrCLIP
Unsupervised Semantic SegmentationADE20KMean IoU (val)30.7CorrCLIP
Unsupervised Semantic SegmentationCityscapes valmIoU51.1CorrCLIP
Unsupervised Semantic SegmentationPASCAL Context-59mIoU50.8CorrCLIP
Unsupervised Semantic SegmentationPASCAL Context-60mIoU44.9CorrCLIP
Unsupervised Semantic SegmentationPascalVOC-20mIoU91.8CorrCLIP
Unsupervised Semantic SegmentationPASCAL VOCmIoU76.7CorrCLIP
10-shot image generationCOCO-Stuff-171mIoU34CorrCLIP
10-shot image generationCOCO-ObjectmIoU49.4CorrCLIP
10-shot image generationADE20KMean IoU (val)30.7CorrCLIP
10-shot image generationCityscapes valmIoU51.1CorrCLIP
10-shot image generationPASCAL Context-59mIoU50.8CorrCLIP
10-shot image generationPASCAL Context-60mIoU44.9CorrCLIP
10-shot image generationPascalVOC-20mIoU91.8CorrCLIP
10-shot image generationPASCAL VOCmIoU76.7CorrCLIP

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21Deep Learning-Based Fetal Lung Segmentation from Diffusion-weighted MRI Images and Lung Maturity Evaluation for Fetal Growth Restriction2025-07-17DiffOSeg: Omni Medical Image Segmentation via Multi-Expert Collaboration Diffusion Model2025-07-17From Variability To Accuracy: Conditional Bernoulli Diffusion Models with Consensus-Driven Correction for Thin Structure Segmentation2025-07-17Unleashing Vision Foundation Models for Coronary Artery Segmentation: Parallel ViT-CNN Encoding and Variational Fusion2025-07-17SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation2025-07-17Unified Medical Image Segmentation with State Space Modeling Snake2025-07-17A Privacy-Preserving Semantic-Segmentation Method Using Domain-Adaptation Technique2025-07-17