Open-Vocabulary Universal Image Segmentation with MaskCLIP

Zheng Ding, Jieke Wang, Zhuowen Tu

2022-08-18Panoptic Segmentation Open Vocabulary Semantic Segmentation Segmentation Semantic Segmentation Open Vocabulary Panoptic Segmentation Instance Segmentation Image Segmentation

Paper PDF Code(official)

Abstract

In this paper, we tackle an emerging computer vision task, open-vocabulary universal image segmentation, that aims to perform semantic/instance/panoptic segmentation (background semantic labeling + foreground instance segmentation) for arbitrary categories of text-based descriptions in inference time. We first build a baseline method by directly adopting pre-trained CLIP models without finetuning or distillation. We then develop MaskCLIP, a Transformer-based approach with a MaskCLIP Visual Encoder, which is an encoder-only module that seamlessly integrates mask tokens with a pre-trained ViT CLIP model for semantic/instance segmentation and class prediction. MaskCLIP learns to efficiently and effectively utilize pre-trained partial/dense CLIP features within the MaskCLIP Visual Encoder that avoids the time-consuming student-teacher training process. MaskCLIP outperforms previous methods for semantic/instance/panoptic segmentation on ADE20K and PASCAL datasets. We show qualitative illustrations for MaskCLIP with online custom categories. Project website: https://maskclip.github.io.

Results

Task	Dataset	Metric	Value	Model
Open Vocabulary Semantic Segmentation	ADE20K-847	mIoU	8.2	MaskCLIP
Open Vocabulary Semantic Segmentation	PASCAL Context-59	mIoU	45.9	MaskCLIP
Open Vocabulary Semantic Segmentation	ADE20K-150	mIoU	23.7	MaskCLIP

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21 Deep Learning-Based Fetal Lung Segmentation from Diffusion-weighted MRI Images and Lung Maturity Evaluation for Fetal Growth Restriction2025-07-17 DiffOSeg: Omni Medical Image Segmentation via Multi-Expert Collaboration Diffusion Model2025-07-17 From Variability To Accuracy: Conditional Bernoulli Diffusion Models with Consensus-Driven Correction for Thin Structure Segmentation2025-07-17 Unleashing Vision Foundation Models for Coronary Artery Segmentation: Parallel ViT-CNN Encoding and Variational Fusion2025-07-17 SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation2025-07-17 Unified Medical Image Segmentation with State Space Modeling Snake2025-07-17 A Privacy-Preserving Semantic-Segmentation Method Using Domain-Adaptation Technique2025-07-17