Jiyoung Kim, Kyuhong Shim, Insu Lee, Byonghyo Shim
Unsupervised semantic segmentation (USS) aims to discover and recognize meaningful categories without any labels. For a successful USS, two key abilities are required: 1) information compression and 2) clustering capability. Previous methods have relied on feature dimension reduction for information compression, however, this approach may hinder the process of clustering. In this paper, we propose a novel USS framework called Expand-and-Quantize Unsupervised Semantic Segmentation (EQUSS), which combines the benefits of high-dimensional spaces for better clustering and product quantization for effective information compression. Our extensive experiments demonstrate that EQUSS achieves state-of-the-art results on three standard benchmarks. In addition, we analyze the entropy of USS features, which is the first step towards understanding USS from the perspective of information theory.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Semantic Segmentation | Potsdam-3 | Accuracy | 82 | EQUSS |
| Semantic Segmentation | Cityscapes test | Accuracy | 79.9 | EQUSS |
| Semantic Segmentation | Cityscapes test | mIoU | 22 | EQUSS |
| Semantic Segmentation | COCO-Stuff-27 | Clustering [Accuracy] | 53.8 | EQUSS |
| Semantic Segmentation | COCO-Stuff-27 | Clustering [mIoU] | 25.8 | EQUSS |
| Semantic Segmentation | COCO-Stuff-27 | Linear Classifier [Accuracy] | 75.2 | EQUSS |
| Semantic Segmentation | COCO-Stuff-27 | Linear Classifier [mIoU] | 41.2 | EQUSS |
| Semantic Segmentation | COCO-Stuff-27 | Clustering [Accuracy] | 53.8 | EQUSS (ViT-S) |
| Semantic Segmentation | COCO-Stuff-27 | Clustering [mIoU] | 25.8 | EQUSS (ViT-S) |
| Unsupervised Semantic Segmentation | Potsdam-3 | Accuracy | 82 | EQUSS |
| Unsupervised Semantic Segmentation | Cityscapes test | Accuracy | 79.9 | EQUSS |
| Unsupervised Semantic Segmentation | Cityscapes test | mIoU | 22 | EQUSS |
| Unsupervised Semantic Segmentation | COCO-Stuff-27 | Clustering [Accuracy] | 53.8 | EQUSS |
| Unsupervised Semantic Segmentation | COCO-Stuff-27 | Clustering [mIoU] | 25.8 | EQUSS |
| Unsupervised Semantic Segmentation | COCO-Stuff-27 | Linear Classifier [Accuracy] | 75.2 | EQUSS |
| Unsupervised Semantic Segmentation | COCO-Stuff-27 | Linear Classifier [mIoU] | 41.2 | EQUSS |
| Unsupervised Semantic Segmentation | COCO-Stuff-27 | Clustering [Accuracy] | 53.8 | EQUSS (ViT-S) |
| Unsupervised Semantic Segmentation | COCO-Stuff-27 | Clustering [mIoU] | 25.8 | EQUSS (ViT-S) |
| 10-shot image generation | Potsdam-3 | Accuracy | 82 | EQUSS |
| 10-shot image generation | Cityscapes test | Accuracy | 79.9 | EQUSS |
| 10-shot image generation | Cityscapes test | mIoU | 22 | EQUSS |
| 10-shot image generation | COCO-Stuff-27 | Clustering [Accuracy] | 53.8 | EQUSS |
| 10-shot image generation | COCO-Stuff-27 | Clustering [mIoU] | 25.8 | EQUSS |
| 10-shot image generation | COCO-Stuff-27 | Linear Classifier [Accuracy] | 75.2 | EQUSS |
| 10-shot image generation | COCO-Stuff-27 | Linear Classifier [mIoU] | 41.2 | EQUSS |
| 10-shot image generation | COCO-Stuff-27 | Clustering [Accuracy] | 53.8 | EQUSS (ViT-S) |
| 10-shot image generation | COCO-Stuff-27 | Clustering [mIoU] | 25.8 | EQUSS (ViT-S) |