TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/ProxyCLIP: Proxy Attention Improves CLIP for Open-Vocabula...

ProxyCLIP: Proxy Attention Improves CLIP for Open-Vocabulary Segmentation

Mengcheng Lan, Chaofeng Chen, Yiping Ke, Xinjiang Wang, Litong Feng, Wayne Zhang

2024-08-09Unsupervised Semantic Segmentation with Language-image Pre-trainingOpen Vocabulary Semantic SegmentationSegmentationSemantic SegmentationOpen-Vocabulary Semantic Segmentation
PaperPDFCode(official)

Abstract

Open-vocabulary semantic segmentation requires models to effectively integrate visual representations with open-vocabulary semantic labels. While Contrastive Language-Image Pre-training (CLIP) models shine in recognizing visual concepts from text, they often struggle with segment coherence due to their limited localization ability. In contrast, Vision Foundation Models (VFMs) excel at acquiring spatially consistent local visual representations, yet they fall short in semantic understanding. This paper introduces ProxyCLIP, an innovative framework designed to harmonize the strengths of both CLIP and VFMs, facilitating enhanced open-vocabulary semantic segmentation. ProxyCLIP leverages the spatial feature correspondence from VFMs as a form of proxy attention to augment CLIP, thereby inheriting the VFMs' robust local consistency and maintaining CLIP's exceptional zero-shot transfer capacity. We propose an adaptive normalization and masking strategy to get the proxy attention from VFMs, allowing for adaptation across different VFMs. Remarkably, as a training-free approach, ProxyCLIP significantly improves the average mean Intersection over Union (mIoU) across eight benchmarks from 40.3 to 44.4, showcasing its exceptional efficacy in bridging the gap between spatial precision and semantic richness for the open-vocabulary segmentation task.

Results

TaskDatasetMetricValueModel
Semantic SegmentationCOCO-Stuff-171mIoU26.8ProxyCLIP
Semantic SegmentationCOCO-ObjectmIoU39.2ProxyCLIP
Semantic SegmentationADE20KMean IoU (val)24.2ProxyCLIP
Semantic SegmentationCityscapes valmIoU42ProxyCLIP
Semantic SegmentationPASCAL Context-59mIoU39.6ProxyCLIP
Semantic SegmentationPASCAL Context-60mIoU35.4ProxyCLIP
Semantic SegmentationPascalVOC-20mIoU83.3ProxyCLIP
Semantic SegmentationPASCAL VOCmIoU65ProxyCLIP
Unsupervised Semantic SegmentationCOCO-Stuff-171mIoU26.8ProxyCLIP
Unsupervised Semantic SegmentationCOCO-ObjectmIoU39.2ProxyCLIP
Unsupervised Semantic SegmentationADE20KMean IoU (val)24.2ProxyCLIP
Unsupervised Semantic SegmentationCityscapes valmIoU42ProxyCLIP
Unsupervised Semantic SegmentationPASCAL Context-59mIoU39.6ProxyCLIP
Unsupervised Semantic SegmentationPASCAL Context-60mIoU35.4ProxyCLIP
Unsupervised Semantic SegmentationPascalVOC-20mIoU83.3ProxyCLIP
Unsupervised Semantic SegmentationPASCAL VOCmIoU65ProxyCLIP
10-shot image generationCOCO-Stuff-171mIoU26.8ProxyCLIP
10-shot image generationCOCO-ObjectmIoU39.2ProxyCLIP
10-shot image generationADE20KMean IoU (val)24.2ProxyCLIP
10-shot image generationCityscapes valmIoU42ProxyCLIP
10-shot image generationPASCAL Context-59mIoU39.6ProxyCLIP
10-shot image generationPASCAL Context-60mIoU35.4ProxyCLIP
10-shot image generationPascalVOC-20mIoU83.3ProxyCLIP
10-shot image generationPASCAL VOCmIoU65ProxyCLIP

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21Deep Learning-Based Fetal Lung Segmentation from Diffusion-weighted MRI Images and Lung Maturity Evaluation for Fetal Growth Restriction2025-07-17DiffOSeg: Omni Medical Image Segmentation via Multi-Expert Collaboration Diffusion Model2025-07-17From Variability To Accuracy: Conditional Bernoulli Diffusion Models with Consensus-Driven Correction for Thin Structure Segmentation2025-07-17Unleashing Vision Foundation Models for Coronary Artery Segmentation: Parallel ViT-CNN Encoding and Variational Fusion2025-07-17SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation2025-07-17Unified Medical Image Segmentation with State Space Modeling Snake2025-07-17A Privacy-Preserving Semantic-Segmentation Method Using Domain-Adaptation Technique2025-07-17