Learning to Generate Text-grounded Mask for Open-world Semantic Segmentation from Only Image-Text Pairs

Junbum Cha, Jonghwan Mun, Byungseok Roh

2022-12-01CVPR 2023 1Unsupervised Semantic Segmentation with Language-image Pre-training Open Vocabulary Semantic Segmentation Zero Shot Segmentation Segmentation Semantic Segmentation Contrastive Learning

Paper PDF Code(official)

Abstract

We tackle open-world semantic segmentation, which aims at learning to segment arbitrary visual concepts in images, by using only image-text pairs without dense annotations. Existing open-world segmentation methods have shown impressive advances by employing contrastive learning (CL) to learn diverse visual concepts and transferring the learned image-level understanding to the segmentation task. However, these CL-based methods suffer from a train-test discrepancy, since it only considers image-text alignment during training, whereas segmentation requires region-text alignment during testing. In this paper, we proposed a novel Text-grounded Contrastive Learning (TCL) framework that enables a model to directly learn region-text alignment. Our method generates a segmentation mask for a given text, extracts text-grounded image embedding from the masked region, and aligns it with text embedding via TCL. By learning region-text alignment directly, our framework encourages a model to directly improve the quality of generated segmentation masks. In addition, for a rigorous and fair comparison, we present a unified evaluation protocol with widely used 8 semantic segmentation datasets. TCL achieves state-of-the-art zero-shot segmentation performances with large margins in all datasets. Code is available at https://github.com/kakaobrain/tcl.

Results

Task	Dataset	Metric	Value	Model
Semantic Segmentation	CC3M-TagMask	mIoU	60.4	TCL
Semantic Segmentation	COCO-Stuff-171	mIoU	22.4	TCL
Semantic Segmentation	COCO-Object	mIoU	31.6	TCL
Semantic Segmentation	ADE20K	Mean IoU (val)	17.1	TCL
Semantic Segmentation	Cityscapes val	mIoU	24	TCL
Semantic Segmentation	PASCAL Context-59	mIoU	33.9	TCL
Semantic Segmentation	PascalVOC-20	mIoU	83.2	TCL
Semantic Segmentation	PASCAL VOC	mIoU	55	TCL
Image Retrieval with Multi-Modal Query	COCO 2014	Text-to-image R@5	83.2	TCL
Unsupervised Semantic Segmentation	COCO-Stuff-171	mIoU	22.4	TCL
Unsupervised Semantic Segmentation	COCO-Object	mIoU	31.6	TCL
Unsupervised Semantic Segmentation	ADE20K	Mean IoU (val)	17.1	TCL
Unsupervised Semantic Segmentation	Cityscapes val	mIoU	24	TCL
Unsupervised Semantic Segmentation	PASCAL Context-59	mIoU	33.9	TCL
Unsupervised Semantic Segmentation	PascalVOC-20	mIoU	83.2	TCL
Unsupervised Semantic Segmentation	PASCAL VOC	mIoU	55	TCL
Cross-Modal Information Retrieval	COCO 2014	Text-to-image R@5	83.2	TCL
Cross-Modal Retrieval	COCO 2014	Text-to-image R@5	83.2	TCL
Open Vocabulary Semantic Segmentation	PascalVOC-20	mIoU	83.2	TCL
Open Vocabulary Semantic Segmentation	PASCAL Context-59	mIoU	33.9	TCL
10-shot image generation	CC3M-TagMask	mIoU	60.4	TCL
10-shot image generation	COCO-Stuff-171	mIoU	22.4	TCL
10-shot image generation	COCO-Object	mIoU	31.6	TCL
10-shot image generation	ADE20K	Mean IoU (val)	17.1	TCL
10-shot image generation	Cityscapes val	mIoU	24	TCL
10-shot image generation	PASCAL Context-59	mIoU	33.9	TCL
10-shot image generation	PascalVOC-20	mIoU	83.2	TCL
10-shot image generation	PASCAL VOC	mIoU	55	TCL

Learning to Generate Text-grounded Mask for Open-world Semantic Segmentation from Only Image-Text Pairs

Abstract

Results

Related Papers

Learning to Generate Text-grounded Mask for Open-world Semantic Segmentation from Only Image-Text Pairs

Abstract

Results

Related Papers