TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Learning Mask-aware CLIP Representations for Zero-Shot Seg...

Learning Mask-aware CLIP Representations for Zero-Shot Segmentation

Siyu Jiao, Yunchao Wei, YaoWei Wang, Yao Zhao, Humphrey Shi

2023-09-30NeurIPS 2023 11Open Vocabulary Semantic SegmentationZero Shot Segmentation
PaperPDFCodeCode(official)

Abstract

Recently, pre-trained vision-language models have been increasingly used to tackle the challenging zero-shot segmentation task. Typical solutions follow the paradigm of first generating mask proposals and then adopting CLIP to classify them. To maintain the CLIP's zero-shot transferability, previous practices favour to freeze CLIP during training. However, in the paper, we reveal that CLIP is insensitive to different mask proposals and tends to produce similar predictions for various mask proposals of the same image. This insensitivity results in numerous false positives when classifying mask proposals. This issue mainly relates to the fact that CLIP is trained with image-level supervision. To alleviate this issue, we propose a simple yet effective method, named Mask-aware Fine-tuning (MAFT). Specifically, Image-Proposals CLIP Encoder (IP-CLIP Encoder) is proposed to handle arbitrary numbers of image and mask proposals simultaneously. Then, mask-aware loss and self-distillation loss are designed to fine-tune IP-CLIP Encoder, ensuring CLIP is responsive to different mask proposals while not sacrificing transferability. In this way, mask-aware representations can be easily learned to make the true positives stand out. Notably, our solution can seamlessly plug into most existing methods without introducing any new parameters during the fine-tuning process. We conduct extensive experiments on the popular zero-shot benchmarks. With MAFT, the performance of the state-of-the-art methods is promoted by a large margin: 50.4% (+ 8.2%) on COCO, 81.8% (+ 3.2%) on Pascal-VOC, and 8.7% (+4.3%) on ADE20K in terms of mIoU for unseen classes. The code is available at https://github.com/jiaosiyu1999/MAFT.git.

Results

TaskDatasetMetricValueModel
Open Vocabulary Semantic SegmentationADE20K-847mIoU12.1MAFT-ViTL
Open Vocabulary Semantic SegmentationPASCAL Context-459mIoU15.7MAFT-ViTL
Open Vocabulary Semantic SegmentationPascalVOC-20mIoU92.1MAFT-ViTL
Open Vocabulary Semantic SegmentationPascalVOC-20mIoU92.1MAFT-ViTL
Open Vocabulary Semantic SegmentationPASCAL Context-59mIoU58.5MAFT-ViTL
Open Vocabulary Semantic SegmentationADE20K-150mIoU32MAFT-ViTL

Related Papers

Personalized OVSS: Understanding Personal Concept in Open-Vocabulary Semantic Segmentation2025-07-15Compress Any Segment Anything Model (SAM)2025-07-11Foundation Models for Zero-Shot Segmentation of Scientific Images without AI-Ready Data2025-06-30ReME: A Data-Centric Framework for Training-Free Open-Vocabulary Segmentation2025-06-26MRI-CORE: A Foundation Model for Magnetic Resonance Imaging2025-06-13Leveraging Depth and Language for Open-Vocabulary Domain-Generalized Semantic Segmentation2025-06-11Textile Analysis for Recycling Automation using Transfer Learning and Zero-Shot Foundation Models2025-06-06Zero-Shot Tree Detection and Segmentation from Aerial Forest Imagery2025-06-03