Junho Kim, Byung-Kwan Lee, Yong Man Ro
Unsupervised semantic segmentation aims to achieve high-quality semantic grouping without human-labeled annotations. With the advent of self-supervised pre-training, various frameworks utilize the pre-trained features to train prediction heads for unsupervised dense prediction. However, a significant challenge in this unsupervised setup is determining the appropriate level of clustering required for segmenting concepts. To address it, we propose a novel framework, CAusal Unsupervised Semantic sEgmentation (CAUSE), which leverages insights from causal inference. Specifically, we bridge intervention-oriented approach (i.e., frontdoor adjustment) to define suitable two-step tasks for unsupervised prediction. The first step involves constructing a concept clusterbook as a mediator, which represents possible concept prototypes at different levels of granularity in a discretized form. Then, the mediator establishes an explicit link to the subsequent concept-wise self-supervised learning for pixel-level grouping. Through extensive experiments and analyses on various datasets, we corroborate the effectiveness of CAUSE and achieve state-of-the-art performance in unsupervised semantic segmentation.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Semantic Segmentation | COCO-Stuff-81 | Pixel Accuracy | 75.2 | CAUSE-TR (ViT-S/8) |
| Semantic Segmentation | COCO-Stuff-81 | mIoU | 21.2 | CAUSE-TR (ViT-S/8) |
| Semantic Segmentation | COCO-Stuff-81 | Pixel Accuracy | 78.8 | CAUSE-MLP (ViT-S/8) |
| Semantic Segmentation | COCO-Stuff-81 | mIoU | 19.1 | CAUSE-MLP (ViT-S/8) |
| Semantic Segmentation | PASCAL VOC 2012 val | Clustering [mIoU] | 53.4 | CAUSE (iBOT, ViT-B/16) |
| Semantic Segmentation | PASCAL VOC 2012 val | Clustering [mIoU] | 53.3 | CAUSE (ViT-B/8) |
| Semantic Segmentation | PASCAL VOC 2012 val | Clustering [mIoU] | 53.2 | CAUSE (DINOv2, ViT-B/14) |
| Semantic Segmentation | COCO-Stuff-171 | Pixel Accuracy | 46.6 | CAUSE-TR (ViT-S/8) |
| Semantic Segmentation | COCO-Stuff-171 | mIoU | 15.2 | CAUSE-TR (ViT-S/8) |
| Semantic Segmentation | COCO-Stuff-27 | Clustering [Accuracy] | 78 | CAUSE (DINOv2, ViT-B/14) |
| Semantic Segmentation | COCO-Stuff-27 | Clustering [mIoU] | 45.3 | CAUSE (DINOv2, ViT-B/14) |
| Semantic Segmentation | COCO-Stuff-27 | Clustering [Accuracy] | 74.9 | CAUSE (ViT-B/8) |
| Semantic Segmentation | COCO-Stuff-27 | Clustering [mIoU] | 41.9 | CAUSE (ViT-B/8) |
| Unsupervised Semantic Segmentation | COCO-Stuff-81 | Pixel Accuracy | 75.2 | CAUSE-TR (ViT-S/8) |
| Unsupervised Semantic Segmentation | COCO-Stuff-81 | mIoU | 21.2 | CAUSE-TR (ViT-S/8) |
| Unsupervised Semantic Segmentation | COCO-Stuff-81 | Pixel Accuracy | 78.8 | CAUSE-MLP (ViT-S/8) |
| Unsupervised Semantic Segmentation | COCO-Stuff-81 | mIoU | 19.1 | CAUSE-MLP (ViT-S/8) |
| Unsupervised Semantic Segmentation | PASCAL VOC 2012 val | Clustering [mIoU] | 53.4 | CAUSE (iBOT, ViT-B/16) |
| Unsupervised Semantic Segmentation | PASCAL VOC 2012 val | Clustering [mIoU] | 53.3 | CAUSE (ViT-B/8) |
| Unsupervised Semantic Segmentation | PASCAL VOC 2012 val | Clustering [mIoU] | 53.2 | CAUSE (DINOv2, ViT-B/14) |
| Unsupervised Semantic Segmentation | COCO-Stuff-171 | Pixel Accuracy | 46.6 | CAUSE-TR (ViT-S/8) |
| Unsupervised Semantic Segmentation | COCO-Stuff-171 | mIoU | 15.2 | CAUSE-TR (ViT-S/8) |
| Unsupervised Semantic Segmentation | COCO-Stuff-27 | Clustering [Accuracy] | 78 | CAUSE (DINOv2, ViT-B/14) |
| Unsupervised Semantic Segmentation | COCO-Stuff-27 | Clustering [mIoU] | 45.3 | CAUSE (DINOv2, ViT-B/14) |
| Unsupervised Semantic Segmentation | COCO-Stuff-27 | Clustering [Accuracy] | 74.9 | CAUSE (ViT-B/8) |
| Unsupervised Semantic Segmentation | COCO-Stuff-27 | Clustering [mIoU] | 41.9 | CAUSE (ViT-B/8) |
| 10-shot image generation | COCO-Stuff-81 | Pixel Accuracy | 75.2 | CAUSE-TR (ViT-S/8) |
| 10-shot image generation | COCO-Stuff-81 | mIoU | 21.2 | CAUSE-TR (ViT-S/8) |
| 10-shot image generation | COCO-Stuff-81 | Pixel Accuracy | 78.8 | CAUSE-MLP (ViT-S/8) |
| 10-shot image generation | COCO-Stuff-81 | mIoU | 19.1 | CAUSE-MLP (ViT-S/8) |
| 10-shot image generation | PASCAL VOC 2012 val | Clustering [mIoU] | 53.4 | CAUSE (iBOT, ViT-B/16) |
| 10-shot image generation | PASCAL VOC 2012 val | Clustering [mIoU] | 53.3 | CAUSE (ViT-B/8) |
| 10-shot image generation | PASCAL VOC 2012 val | Clustering [mIoU] | 53.2 | CAUSE (DINOv2, ViT-B/14) |
| 10-shot image generation | COCO-Stuff-171 | Pixel Accuracy | 46.6 | CAUSE-TR (ViT-S/8) |
| 10-shot image generation | COCO-Stuff-171 | mIoU | 15.2 | CAUSE-TR (ViT-S/8) |
| 10-shot image generation | COCO-Stuff-27 | Clustering [Accuracy] | 78 | CAUSE (DINOv2, ViT-B/14) |
| 10-shot image generation | COCO-Stuff-27 | Clustering [mIoU] | 45.3 | CAUSE (DINOv2, ViT-B/14) |
| 10-shot image generation | COCO-Stuff-27 | Clustering [Accuracy] | 74.9 | CAUSE (ViT-B/8) |
| 10-shot image generation | COCO-Stuff-27 | Clustering [mIoU] | 41.9 | CAUSE (ViT-B/8) |