Junbum Cha, Jonghwan Mun, Byungseok Roh
We tackle open-world semantic segmentation, which aims at learning to segment arbitrary visual concepts in images, by using only image-text pairs without dense annotations. Existing open-world segmentation methods have shown impressive advances by employing contrastive learning (CL) to learn diverse visual concepts and transferring the learned image-level understanding to the segmentation task. However, these CL-based methods suffer from a train-test discrepancy, since it only considers image-text alignment during training, whereas segmentation requires region-text alignment during testing. In this paper, we proposed a novel Text-grounded Contrastive Learning (TCL) framework that enables a model to directly learn region-text alignment. Our method generates a segmentation mask for a given text, extracts text-grounded image embedding from the masked region, and aligns it with text embedding via TCL. By learning region-text alignment directly, our framework encourages a model to directly improve the quality of generated segmentation masks. In addition, for a rigorous and fair comparison, we present a unified evaluation protocol with widely used 8 semantic segmentation datasets. TCL achieves state-of-the-art zero-shot segmentation performances with large margins in all datasets. Code is available at https://github.com/kakaobrain/tcl.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Semantic Segmentation | CC3M-TagMask | mIoU | 60.4 | TCL |
| Semantic Segmentation | COCO-Stuff-171 | mIoU | 22.4 | TCL |
| Semantic Segmentation | COCO-Object | mIoU | 31.6 | TCL |
| Semantic Segmentation | ADE20K | Mean IoU (val) | 17.1 | TCL |
| Semantic Segmentation | Cityscapes val | mIoU | 24 | TCL |
| Semantic Segmentation | PASCAL Context-59 | mIoU | 33.9 | TCL |
| Semantic Segmentation | PascalVOC-20 | mIoU | 83.2 | TCL |
| Semantic Segmentation | PASCAL VOC | mIoU | 55 | TCL |
| Image Retrieval with Multi-Modal Query | COCO 2014 | Text-to-image R@5 | 83.2 | TCL |
| Unsupervised Semantic Segmentation | COCO-Stuff-171 | mIoU | 22.4 | TCL |
| Unsupervised Semantic Segmentation | COCO-Object | mIoU | 31.6 | TCL |
| Unsupervised Semantic Segmentation | ADE20K | Mean IoU (val) | 17.1 | TCL |
| Unsupervised Semantic Segmentation | Cityscapes val | mIoU | 24 | TCL |
| Unsupervised Semantic Segmentation | PASCAL Context-59 | mIoU | 33.9 | TCL |
| Unsupervised Semantic Segmentation | PascalVOC-20 | mIoU | 83.2 | TCL |
| Unsupervised Semantic Segmentation | PASCAL VOC | mIoU | 55 | TCL |
| Cross-Modal Information Retrieval | COCO 2014 | Text-to-image R@5 | 83.2 | TCL |
| Cross-Modal Retrieval | COCO 2014 | Text-to-image R@5 | 83.2 | TCL |
| Open Vocabulary Semantic Segmentation | PascalVOC-20 | mIoU | 83.2 | TCL |
| Open Vocabulary Semantic Segmentation | PASCAL Context-59 | mIoU | 33.9 | TCL |
| 10-shot image generation | CC3M-TagMask | mIoU | 60.4 | TCL |
| 10-shot image generation | COCO-Stuff-171 | mIoU | 22.4 | TCL |
| 10-shot image generation | COCO-Object | mIoU | 31.6 | TCL |
| 10-shot image generation | ADE20K | Mean IoU (val) | 17.1 | TCL |
| 10-shot image generation | Cityscapes val | mIoU | 24 | TCL |
| 10-shot image generation | PASCAL Context-59 | mIoU | 33.9 | TCL |
| 10-shot image generation | PascalVOC-20 | mIoU | 83.2 | TCL |
| 10-shot image generation | PASCAL VOC | mIoU | 55 | TCL |