TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/A Simple Framework for Open-Vocabulary Segmentation and De...

A Simple Framework for Open-Vocabulary Segmentation and Detection

Hao Zhang, Feng Li, Xueyan Zou, Shilong Liu, Chunyuan Li, Jianfeng Gao, Jianwei Yang, Lei Zhang

2023-03-14ICCV 2023 1Panoptic SegmentationZero Shot SegmentationSegmentationSemantic SegmentationInstance Segmentation
PaperPDFCode(official)Code

Abstract

We present OpenSeeD, a simple Open-vocabulary Segmentation and Detection framework that jointly learns from different segmentation and detection datasets. To bridge the gap of vocabulary and annotation granularity, we first introduce a pre-trained text encoder to encode all the visual concepts in two tasks and learn a common semantic space for them. This gives us reasonably good results compared with the counterparts trained on segmentation task only. To further reconcile them, we locate two discrepancies: $i$) task discrepancy -- segmentation requires extracting masks for both foreground objects and background stuff, while detection merely cares about the former; $ii$) data discrepancy -- box and mask annotations are with different spatial granularity, and thus not directly interchangeable. To address these issues, we propose a decoupled decoding to reduce the interference between foreground/background and a conditioned mask decoding to assist in generating masks for given boxes. To this end, we develop a simple encoder-decoder model encompassing all three techniques and train it jointly on COCO and Objects365. After pre-training, our model exhibits competitive or stronger zero-shot transferability for both segmentation and detection. Specifically, OpenSeeD beats the state-of-the-art method for open-vocabulary instance and panoptic segmentation across 5 datasets, and outperforms previous work for open-vocabulary detection on LVIS and ODinW under similar settings. When transferred to specific tasks, our model achieves new SoTA for panoptic segmentation on COCO and ADE20K, and instance segmentation on ADE20K and Cityscapes. Finally, we note that OpenSeeD is the first to explore the potential of joint training on segmentation and detection, and hope it can be received as a strong baseline for developing a single model for both tasks in open world.

Results

TaskDatasetMetricValueModel
Semantic SegmentationADE20K valPQ53.7OpenSeed(SwinL, single scale, 1280x1280)
Semantic SegmentationCOCO minivalAP53.2OpenSeeD (SwinL, single-scale)
Semantic SegmentationCOCO minivalPQ59.5OpenSeeD (SwinL, single-scale)
Instance SegmentationCityscapes valmask AP48.5OpenSeeD( SwinL, single-scale)
Instance SegmentationADE20K valAP42.6OpenSeeD
Zero Shot SegmentationSegmentation in the WildMean AP36.1OpenSEED
10-shot image generationADE20K valPQ53.7OpenSeed(SwinL, single scale, 1280x1280)
10-shot image generationCOCO minivalAP53.2OpenSeeD (SwinL, single-scale)
10-shot image generationCOCO minivalPQ59.5OpenSeeD (SwinL, single-scale)
Panoptic SegmentationADE20K valPQ53.7OpenSeed(SwinL, single scale, 1280x1280)
Panoptic SegmentationCOCO minivalAP53.2OpenSeeD (SwinL, single-scale)
Panoptic SegmentationCOCO minivalPQ59.5OpenSeeD (SwinL, single-scale)

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21Deep Learning-Based Fetal Lung Segmentation from Diffusion-weighted MRI Images and Lung Maturity Evaluation for Fetal Growth Restriction2025-07-17DiffOSeg: Omni Medical Image Segmentation via Multi-Expert Collaboration Diffusion Model2025-07-17From Variability To Accuracy: Conditional Bernoulli Diffusion Models with Consensus-Driven Correction for Thin Structure Segmentation2025-07-17Unleashing Vision Foundation Models for Coronary Artery Segmentation: Parallel ViT-CNN Encoding and Variational Fusion2025-07-17SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation2025-07-17Unified Medical Image Segmentation with State Space Modeling Snake2025-07-17A Privacy-Preserving Semantic-Segmentation Method Using Domain-Adaptation Technique2025-07-17