TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Learning to Generate Text-grounded Mask for Open-world Sem...

Learning to Generate Text-grounded Mask for Open-world Semantic Segmentation from Only Image-Text Pairs

Junbum Cha, Jonghwan Mun, Byungseok Roh

2022-12-01CVPR 2023 1Unsupervised Semantic Segmentation with Language-image Pre-trainingOpen Vocabulary Semantic SegmentationZero Shot SegmentationSegmentationSemantic SegmentationContrastive Learning
PaperPDFCode(official)

Abstract

We tackle open-world semantic segmentation, which aims at learning to segment arbitrary visual concepts in images, by using only image-text pairs without dense annotations. Existing open-world segmentation methods have shown impressive advances by employing contrastive learning (CL) to learn diverse visual concepts and transferring the learned image-level understanding to the segmentation task. However, these CL-based methods suffer from a train-test discrepancy, since it only considers image-text alignment during training, whereas segmentation requires region-text alignment during testing. In this paper, we proposed a novel Text-grounded Contrastive Learning (TCL) framework that enables a model to directly learn region-text alignment. Our method generates a segmentation mask for a given text, extracts text-grounded image embedding from the masked region, and aligns it with text embedding via TCL. By learning region-text alignment directly, our framework encourages a model to directly improve the quality of generated segmentation masks. In addition, for a rigorous and fair comparison, we present a unified evaluation protocol with widely used 8 semantic segmentation datasets. TCL achieves state-of-the-art zero-shot segmentation performances with large margins in all datasets. Code is available at https://github.com/kakaobrain/tcl.

Results

TaskDatasetMetricValueModel
Semantic SegmentationCC3M-TagMaskmIoU60.4TCL
Semantic SegmentationCOCO-Stuff-171mIoU22.4TCL
Semantic SegmentationCOCO-ObjectmIoU31.6TCL
Semantic SegmentationADE20KMean IoU (val)17.1TCL
Semantic SegmentationCityscapes valmIoU24TCL
Semantic SegmentationPASCAL Context-59mIoU33.9TCL
Semantic SegmentationPascalVOC-20mIoU83.2TCL
Semantic SegmentationPASCAL VOCmIoU55TCL
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@583.2TCL
Unsupervised Semantic SegmentationCOCO-Stuff-171mIoU22.4TCL
Unsupervised Semantic SegmentationCOCO-ObjectmIoU31.6TCL
Unsupervised Semantic SegmentationADE20KMean IoU (val)17.1TCL
Unsupervised Semantic SegmentationCityscapes valmIoU24TCL
Unsupervised Semantic SegmentationPASCAL Context-59mIoU33.9TCL
Unsupervised Semantic SegmentationPascalVOC-20mIoU83.2TCL
Unsupervised Semantic SegmentationPASCAL VOCmIoU55TCL
Cross-Modal Information RetrievalCOCO 2014Text-to-image R@583.2TCL
Cross-Modal RetrievalCOCO 2014Text-to-image R@583.2TCL
Open Vocabulary Semantic SegmentationPascalVOC-20mIoU83.2TCL
Open Vocabulary Semantic SegmentationPASCAL Context-59mIoU33.9TCL
10-shot image generationCC3M-TagMaskmIoU60.4TCL
10-shot image generationCOCO-Stuff-171mIoU22.4TCL
10-shot image generationCOCO-ObjectmIoU31.6TCL
10-shot image generationADE20KMean IoU (val)17.1TCL
10-shot image generationCityscapes valmIoU24TCL
10-shot image generationPASCAL Context-59mIoU33.9TCL
10-shot image generationPascalVOC-20mIoU83.2TCL
10-shot image generationPASCAL VOCmIoU55TCL

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21Deep Learning-Based Fetal Lung Segmentation from Diffusion-weighted MRI Images and Lung Maturity Evaluation for Fetal Growth Restriction2025-07-17DiffOSeg: Omni Medical Image Segmentation via Multi-Expert Collaboration Diffusion Model2025-07-17From Variability To Accuracy: Conditional Bernoulli Diffusion Models with Consensus-Driven Correction for Thin Structure Segmentation2025-07-17Unleashing Vision Foundation Models for Coronary Artery Segmentation: Parallel ViT-CNN Encoding and Variational Fusion2025-07-17SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation2025-07-17Unified Medical Image Segmentation with State Space Modeling Snake2025-07-17A Privacy-Preserving Semantic-Segmentation Method Using Domain-Adaptation Technique2025-07-17