TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/TagCLIP: A Local-to-Global Framework to Enhance Open-Vocab...

TagCLIP: A Local-to-Global Framework to Enhance Open-Vocabulary Multi-Label Classification of CLIP Without Training

Yuqi Lin, Minghao Chen, Kaipeng Zhang, Hengjia Li, Mingming Li, Zheng Yang, Dongqin Lv, Binbin Lin, Haifeng Liu, Deng Cai

2023-12-20Weakly-Supervised Semantic SegmentationUnsupervised Semantic Segmentation with Language-image Pre-trainingWeakly supervised Semantic SegmentationSemantic SegmentationClassificationMulti-Label Classification
PaperPDFCode(official)

Abstract

Contrastive Language-Image Pre-training (CLIP) has demonstrated impressive capabilities in open-vocabulary classification. The class token in the image encoder is trained to capture the global features to distinguish different text descriptions supervised by contrastive loss, making it highly effective for single-label classification. However, it shows poor performance on multi-label datasets because the global feature tends to be dominated by the most prominent class and the contrastive nature of softmax operation aggravates it. In this study, we observe that the multi-label classification results heavily rely on discriminative local features but are overlooked by CLIP. As a result, we dissect the preservation of patch-wise spatial information in CLIP and proposed a local-to-global framework to obtain image tags. It comprises three steps: (1) patch-level classification to obtain coarse scores; (2) dual-masking attention refinement (DMAR) module to refine the coarse scores; (3) class-wise reidentification (CWR) module to remedy predictions from a global perspective. This framework is solely based on frozen CLIP and significantly enhances its multi-label classification performance on various benchmarks without dataset-specific training. Besides, to comprehensively assess the quality and practicality of generated tags, we extend their application to the downstream task, i.e., weakly supervised semantic segmentation (WSSS) with generated tags as image-level pseudo labels. Experiments demonstrate that this classify-then-segment paradigm dramatically outperforms other annotation-free segmentation methods and validates the effectiveness of generated tags. Our code is available at https://github.com/linyq2117/TagCLIP.

Results

TaskDatasetMetricValueModel
Semantic SegmentationCOCO-ObjectmIoU35.3CLS-SEG
Semantic SegmentationPASCAL VOCmIoU68.7CLS-SEG
Semantic SegmentationCOCO-Stuff-27mIoU31CLS-SEG
Unsupervised Semantic SegmentationCOCO-ObjectmIoU35.3CLS-SEG
Unsupervised Semantic SegmentationPASCAL VOCmIoU68.7CLS-SEG
Unsupervised Semantic SegmentationCOCO-Stuff-27mIoU31CLS-SEG
10-shot image generationCOCO-ObjectmIoU35.3CLS-SEG
10-shot image generationPASCAL VOCmIoU68.7CLS-SEG
10-shot image generationCOCO-Stuff-27mIoU31CLS-SEG

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21DiffOSeg: Omni Medical Image Segmentation via Multi-Expert Collaboration Diffusion Model2025-07-17SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation2025-07-17Unified Medical Image Segmentation with State Space Modeling Snake2025-07-17A Privacy-Preserving Semantic-Segmentation Method Using Domain-Adaptation Technique2025-07-17Adversarial attacks to image classification systems using evolutionary algorithms2025-07-17SAMST: A Transformer framework based on SAM pseudo label filtering for remote sensing semi-supervised semantic segmentation2025-07-16Efficient Calisthenics Skills Classification through Foreground Instance Selection and Depth Estimation2025-07-16