TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Convolutions Die Hard: Open-Vocabulary Segmentation with S...

Convolutions Die Hard: Open-Vocabulary Segmentation with Single Frozen Convolutional CLIP

Qihang Yu, Ju He, Xueqing Deng, Xiaohui Shen, Liang-Chieh Chen

2023-08-04NeurIPS 2023 11Open Vocabulary Semantic SegmentationSemantic SegmentationOpen-Vocabulary Semantic SegmentationOpen Vocabulary Panoptic Segmentation
PaperPDFCode(official)

Abstract

Open-vocabulary segmentation is a challenging task requiring segmenting and recognizing objects from an open set of categories. One way to address this challenge is to leverage multi-modal models, such as CLIP, to provide image and text features in a shared embedding space, which bridges the gap between closed-vocabulary and open-vocabulary recognition. Hence, existing methods often adopt a two-stage framework to tackle the problem, where the inputs first go through a mask generator and then through the CLIP model along with the predicted masks. This process involves extracting features from images multiple times, which can be ineffective and inefficient. By contrast, we propose to build everything into a single-stage framework using a shared Frozen Convolutional CLIP backbone, which not only significantly simplifies the current two-stage pipeline, but also remarkably yields a better accuracy-cost trade-off. The proposed FC-CLIP, benefits from the following observations: the frozen CLIP backbone maintains the ability of open-vocabulary classification and can also serve as a strong mask generator, and the convolutional CLIP generalizes well to a larger input resolution than the one used during contrastive image-text pretraining. When training on COCO panoptic data only and testing in a zero-shot manner, FC-CLIP achieve 26.8 PQ, 16.8 AP, and 34.1 mIoU on ADE20K, 18.2 PQ, 27.9 mIoU on Mapillary Vistas, 44.0 PQ, 26.8 AP, 56.2 mIoU on Cityscapes, outperforming the prior art by +4.2 PQ, +2.4 AP, +4.2 mIoU on ADE20K, +4.0 PQ on Mapillary Vistas and +20.1 PQ on Cityscapes, respectively. Additionally, the training and testing time of FC-CLIP is 7.5x and 6.6x significantly faster than the same prior art, while using 5.9x fewer parameters. FC-CLIP also sets a new state-of-the-art performance across various open-vocabulary semantic segmentation datasets. Code at https://github.com/bytedance/fc-clip

Results

TaskDatasetMetricValueModel
Open Vocabulary Panoptic SegmentationADE20KPQ26.8FC-CLIP
Open Vocabulary Semantic SegmentationADE20K-847mIoU14.8FC-CLIP
Open Vocabulary Semantic SegmentationCityscapesmIoU56.2FC-CLIP
Open Vocabulary Semantic SegmentationPascalVOC-20bmIoU81.8FC-CLIP
Open Vocabulary Semantic SegmentationPASCAL Context-459mIoU18.2FC-CLIP
Open Vocabulary Semantic SegmentationPascalVOC-20mIoU95.4FC-CLIP
Open Vocabulary Semantic SegmentationPASCAL Context-59mIoU58.4FC-CLIP
Open Vocabulary Semantic SegmentationADE20K-150mIoU34.1FC-CLIP

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21DiffOSeg: Omni Medical Image Segmentation via Multi-Expert Collaboration Diffusion Model2025-07-17SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation2025-07-17Unified Medical Image Segmentation with State Space Modeling Snake2025-07-17A Privacy-Preserving Semantic-Segmentation Method Using Domain-Adaptation Technique2025-07-17SAMST: A Transformer framework based on SAM pseudo label filtering for remote sensing semi-supervised semantic segmentation2025-07-16Personalized OVSS: Understanding Personal Concept in Open-Vocabulary Semantic Segmentation2025-07-15Tomato Multi-Angle Multi-Pose Dataset for Fine-Grained Phenotyping2025-07-15