Gyungin Shin, Weidi Xie, Samuel Albanie
Semantic segmentation has a broad range of applications, but its real-world impact has been significantly limited by the prohibitive annotation costs necessary to enable deployment. Segmentation methods that forgo supervision can side-step these costs, but exhibit the inconvenient requirement to provide labelled examples from the target distribution to assign concept names to predictions. An alternative line of work in language-image pre-training has recently demonstrated the potential to produce models that can both assign names across large vocabularies of concepts and enable zero-shot transfer for classification, but do not demonstrate commensurate segmentation abilities. In this work, we strive to achieve a synthesis of these two approaches that combines their strengths. We leverage the retrieval abilities of one such language-image pre-trained model, CLIP, to dynamically curate training sets from unlabelled images for arbitrary collections of concept names, and leverage the robust correspondences offered by modern image representations to co-segment entities among the resulting collections. The synthetic segment collections are then employed to construct a segmentation model (without requiring pixel labels) whose knowledge of concepts is inherited from the scalable pre-training process of CLIP. We demonstrate that our approach, termed Retrieve and Co-segment (ReCo) performs favourably to unsupervised segmentation approaches while inheriting the convenience of nameable predictions and zero-shot transfer. We also demonstrate ReCo's ability to generate specialist segmenters for extremely rare objects.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Semantic Segmentation | COCO-Stuff-171 | mIoU | 14.8 | ReCo |
| Semantic Segmentation | COCO-Object | mIoU | 15.7 | ReCo |
| Semantic Segmentation | ADE20K | Mean IoU (val) | 11.2 | ReCo |
| Semantic Segmentation | Cityscapes val | mIoU | 24.2 | ReCo+ |
| Semantic Segmentation | Cityscapes val | pixel accuracy | 83.7 | ReCo+ |
| Semantic Segmentation | Cityscapes val | mIoU | 19.3 | ReCo |
| Semantic Segmentation | Cityscapes val | pixel accuracy | 74.6 | ReCo |
| Semantic Segmentation | PASCAL Context-59 | mIoU | 22.3 | ReCo |
| Semantic Segmentation | PascalVOC-20 | mIoU | 57.7 | ReCo |
| Semantic Segmentation | KITTI-STEP | mIoU | 31.9 | ReCo+ |
| Semantic Segmentation | KITTI-STEP | pixel accuracy | 75.3 | ReCo+ |
| Semantic Segmentation | KITTI-STEP | mIoU | 29.8 | ReCo |
| Semantic Segmentation | KITTI-STEP | pixel accuracy | 70.6 | ReCo |
| Semantic Segmentation | COCO-Stuff-27 | mIoU | 32.6 | ReCo+ |
| Semantic Segmentation | COCO-Stuff-27 | pixel accuracy | 54.1 | ReCo+ |
| Semantic Segmentation | COCO-Stuff-27 | mIoU | 26.3 | ReCo |
| Semantic Segmentation | COCO-Stuff-27 | pixel accuracy | 46.1 | ReCo |
| Unsupervised Semantic Segmentation | COCO-Stuff-171 | mIoU | 14.8 | ReCo |
| Unsupervised Semantic Segmentation | COCO-Object | mIoU | 15.7 | ReCo |
| Unsupervised Semantic Segmentation | ADE20K | Mean IoU (val) | 11.2 | ReCo |
| Unsupervised Semantic Segmentation | Cityscapes val | mIoU | 24.2 | ReCo+ |
| Unsupervised Semantic Segmentation | Cityscapes val | pixel accuracy | 83.7 | ReCo+ |
| Unsupervised Semantic Segmentation | Cityscapes val | mIoU | 19.3 | ReCo |
| Unsupervised Semantic Segmentation | Cityscapes val | pixel accuracy | 74.6 | ReCo |
| Unsupervised Semantic Segmentation | PASCAL Context-59 | mIoU | 22.3 | ReCo |
| Unsupervised Semantic Segmentation | PascalVOC-20 | mIoU | 57.7 | ReCo |
| Unsupervised Semantic Segmentation | KITTI-STEP | mIoU | 31.9 | ReCo+ |
| Unsupervised Semantic Segmentation | KITTI-STEP | pixel accuracy | 75.3 | ReCo+ |
| Unsupervised Semantic Segmentation | KITTI-STEP | mIoU | 29.8 | ReCo |
| Unsupervised Semantic Segmentation | KITTI-STEP | pixel accuracy | 70.6 | ReCo |
| Unsupervised Semantic Segmentation | COCO-Stuff-27 | mIoU | 32.6 | ReCo+ |
| Unsupervised Semantic Segmentation | COCO-Stuff-27 | pixel accuracy | 54.1 | ReCo+ |
| Unsupervised Semantic Segmentation | COCO-Stuff-27 | mIoU | 26.3 | ReCo |
| Unsupervised Semantic Segmentation | COCO-Stuff-27 | pixel accuracy | 46.1 | ReCo |
| 10-shot image generation | COCO-Stuff-171 | mIoU | 14.8 | ReCo |
| 10-shot image generation | COCO-Object | mIoU | 15.7 | ReCo |
| 10-shot image generation | ADE20K | Mean IoU (val) | 11.2 | ReCo |
| 10-shot image generation | Cityscapes val | mIoU | 24.2 | ReCo+ |
| 10-shot image generation | Cityscapes val | pixel accuracy | 83.7 | ReCo+ |
| 10-shot image generation | Cityscapes val | mIoU | 19.3 | ReCo |
| 10-shot image generation | Cityscapes val | pixel accuracy | 74.6 | ReCo |
| 10-shot image generation | PASCAL Context-59 | mIoU | 22.3 | ReCo |
| 10-shot image generation | PascalVOC-20 | mIoU | 57.7 | ReCo |
| 10-shot image generation | KITTI-STEP | mIoU | 31.9 | ReCo+ |
| 10-shot image generation | KITTI-STEP | pixel accuracy | 75.3 | ReCo+ |
| 10-shot image generation | KITTI-STEP | mIoU | 29.8 | ReCo |
| 10-shot image generation | KITTI-STEP | pixel accuracy | 70.6 | ReCo |
| 10-shot image generation | COCO-Stuff-27 | mIoU | 32.6 | ReCo+ |
| 10-shot image generation | COCO-Stuff-27 | pixel accuracy | 54.1 | ReCo+ |
| 10-shot image generation | COCO-Stuff-27 | mIoU | 26.3 | ReCo |
| 10-shot image generation | COCO-Stuff-27 | pixel accuracy | 46.1 | ReCo |