TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Collaborative Vision-Text Representation Optimizing for Op...

Collaborative Vision-Text Representation Optimizing for Open-Vocabulary Segmentation

Siyu Jiao, Hongguang Zhu, Jiannan Huang, Yao Zhao, Yunchao Wei, Humphrey Shi

2024-08-01Open Vocabulary Semantic SegmentationSemantic SegmentationOpen Vocabulary Panoptic Segmentation
PaperPDFCode(official)

Abstract

Pre-trained vision-language models, e.g. CLIP, have been increasingly used to address the challenging Open-Vocabulary Segmentation (OVS) task, benefiting from their well-aligned vision-text embedding space. Typical solutions involve either freezing CLIP during training to unilaterally maintain its zero-shot capability, or fine-tuning CLIP vision encoder to achieve perceptual sensitivity to local regions. However, few of them incorporate vision-text collaborative optimization. Based on this, we propose the Content-Dependent Transfer to adaptively enhance each text embedding by interacting with the input image, which presents a parameter-efficient way to optimize the text representation. Besides, we additionally introduce a Representation Compensation strategy, reviewing the original CLIP-V representation as compensation to maintain the zero-shot capability of CLIP. In this way, the vision and text representation of CLIP are optimized collaboratively, enhancing the alignment of the vision-text feature space. To the best of our knowledge, we are the first to establish the collaborative vision-text optimizing mechanism within the OVS field. Extensive experiments demonstrate our method achieves superior performance on popular OVS benchmarks. In open-vocabulary semantic segmentation, our method outperforms the previous state-of-the-art approaches by +0.5, +2.3, +3.4, +0.4 and +1.1 mIoU, respectively on A-847, A-150, PC-459, PC-59 and PAS-20. Furthermore, in a panoptic setting on ADE20K, we achieve the performance of 27.1 PQ, 73.5 SQ, and 32.9 RQ. Code will be available at https://github.com/jiaosiyu1999/MAFT-Plus.git .

Results

TaskDatasetMetricValueModel
Open Vocabulary Panoptic SegmentationADE20KPQ27.1MAFT+
Open Vocabulary Semantic SegmentationADE20K-847mIoU15.1MAFT+
Open Vocabulary Semantic SegmentationPASCAL Context-459mIoU21.6MAFT+
Open Vocabulary Semantic SegmentationPascalVOC-20mIoU96.5MAFT+
Open Vocabulary Semantic SegmentationPASCAL Context-59mIoU59.4MAFT+
Open Vocabulary Semantic SegmentationADE20K-150mIoU36.1MAFT+

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21DiffOSeg: Omni Medical Image Segmentation via Multi-Expert Collaboration Diffusion Model2025-07-17SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation2025-07-17Unified Medical Image Segmentation with State Space Modeling Snake2025-07-17A Privacy-Preserving Semantic-Segmentation Method Using Domain-Adaptation Technique2025-07-17SAMST: A Transformer framework based on SAM pseudo label filtering for remote sensing semi-supervised semantic segmentation2025-07-16Personalized OVSS: Understanding Personal Concept in Open-Vocabulary Semantic Segmentation2025-07-15Tomato Multi-Angle Multi-Pose Dataset for Fine-Grained Phenotyping2025-07-15