TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Harnessing Vision Foundation Models for High-Performance, ...

Harnessing Vision Foundation Models for High-Performance, Training-Free Open Vocabulary Segmentation

Yuheng Shi, Minjing Dong, Chang Xu

2024-11-14Unsupervised Semantic Segmentation with Language-image Pre-trainingSegmentationSemantic Segmentation
PaperPDFCode(official)

Abstract

While Contrastive Language-Image Pre-training (CLIP) has advanced open-vocabulary predictions, its performance on semantic segmentation remains suboptimal. This shortfall primarily stems from its spatial-invariant semantic features and constrained resolution. While previous adaptations addressed spatial invariance semantic by modifying the self-attention in CLIP's image encoder, the issue of limited resolution remains unexplored. Different from previous segment-then-splice methods that segment sub-images via a sliding window and splice the results, we introduce a splice-then-segment paradigm that incorporates Segment-Anything Model (SAM) to tackle the resolution issue since SAM excels at extracting fine-grained semantic correlations from high-resolution images. Specifically, we introduce Trident, a training-free framework that first splices features extracted by CLIP and DINO from sub-images, then leverages SAM's encoder to create a correlation matrix for global aggregation, enabling a broadened receptive field for effective segmentation. Besides, we propose a refinement strategy for CLIP's coarse segmentation outputs by transforming them into prompts for SAM, further enhancing the segmentation performance. Trident achieves a significant improvement in the mIoU across eight benchmarks compared with the current SOTA, increasing from 44.4 to 48.6.Code is available at https://github.com/YuHengsss/Trident.

Results

TaskDatasetMetricValueModel
Semantic SegmentationCOCO-Stuff-171mIoU28.6Trident
Semantic SegmentationCOCO-ObjectmIoU42.2Trident
Semantic SegmentationADE20KMean IoU (val)26.7Trident
Semantic SegmentationCityscapes valmIoU47.6Trident
Semantic SegmentationPASCAL Context-59mIoU44.3Trident
Semantic SegmentationPASCAL Context-60mIoU40.1Trident
Semantic SegmentationPascalVOC-20mIoU88.7Trident
Semantic SegmentationPASCAL VOCmIoU70.8Trident
Unsupervised Semantic SegmentationCOCO-Stuff-171mIoU28.6Trident
Unsupervised Semantic SegmentationCOCO-ObjectmIoU42.2Trident
Unsupervised Semantic SegmentationADE20KMean IoU (val)26.7Trident
Unsupervised Semantic SegmentationCityscapes valmIoU47.6Trident
Unsupervised Semantic SegmentationPASCAL Context-59mIoU44.3Trident
Unsupervised Semantic SegmentationPASCAL Context-60mIoU40.1Trident
Unsupervised Semantic SegmentationPascalVOC-20mIoU88.7Trident
Unsupervised Semantic SegmentationPASCAL VOCmIoU70.8Trident
10-shot image generationCOCO-Stuff-171mIoU28.6Trident
10-shot image generationCOCO-ObjectmIoU42.2Trident
10-shot image generationADE20KMean IoU (val)26.7Trident
10-shot image generationCityscapes valmIoU47.6Trident
10-shot image generationPASCAL Context-59mIoU44.3Trident
10-shot image generationPASCAL Context-60mIoU40.1Trident
10-shot image generationPascalVOC-20mIoU88.7Trident
10-shot image generationPASCAL VOCmIoU70.8Trident

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21Deep Learning-Based Fetal Lung Segmentation from Diffusion-weighted MRI Images and Lung Maturity Evaluation for Fetal Growth Restriction2025-07-17DiffOSeg: Omni Medical Image Segmentation via Multi-Expert Collaboration Diffusion Model2025-07-17From Variability To Accuracy: Conditional Bernoulli Diffusion Models with Consensus-Driven Correction for Thin Structure Segmentation2025-07-17Unleashing Vision Foundation Models for Coronary Artery Segmentation: Parallel ViT-CNN Encoding and Variational Fusion2025-07-17SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation2025-07-17Unified Medical Image Segmentation with State Space Modeling Snake2025-07-17A Privacy-Preserving Semantic-Segmentation Method Using Domain-Adaptation Technique2025-07-17