Unsupervised Semantic Segmentation Through Depth-Guided Feature Correlation and Sampling

Leon Sick, Dominik Engel, Pedro Hermosilla, Timo Ropinski

2023-09-21CVPR 2024 1Unsupervised Semantic Segmentation Semantic Segmentation Unsupervised Panoptic Segmentation

Abstract

Traditionally, training neural networks to perform semantic segmentation required expensive human-made annotations. But more recently, advances in the field of unsupervised learning have made significant progress on this issue and towards closing the gap to supervised algorithms. To achieve this, semantic knowledge is distilled by learning to correlate randomly sampled features from images across an entire dataset. In this work, we build upon these advances by incorporating information about the structure of the scene into the training process through the use of depth information. We achieve this by (1) learning depth-feature correlation by spatially correlate the feature maps with the depth maps to induce knowledge about the structure of the scene and (2) implementing farthest-point sampling to more effectively select relevant features by utilizing 3D sampling techniques on depth information of the scene. Finally, we demonstrate the effectiveness of our technical contributions through extensive experimentation and present significant improvements in performance across multiple benchmark datasets.

Results

Task	Dataset	Metric	Value	Model
Semantic Segmentation	COCO-Stuff-27	Clustering [Accuracy]	58.6	DepthG (ViT-B/8)
Semantic Segmentation	COCO-Stuff-27	Clustering [mIoU]	29	DepthG (ViT-B/8)
Semantic Segmentation	COCO-Stuff-27	Linear Classifier [Accuracy]	75.5	DepthG (ViT-B/8)
Semantic Segmentation	COCO-Stuff-27	Linear Classifier [mIoU]	41.6	DepthG (ViT-B/8)
Semantic Segmentation	COCO-Stuff-27	Clustering [Accuracy]	55.1	DepthG w/ 3D-LHP (ViT-S/8)
Semantic Segmentation	COCO-Stuff-27	Clustering [mIoU]	26.7	DepthG w/ 3D-LHP (ViT-S/8)
Semantic Segmentation	COCO-Stuff-27	Linear Classifier [Accuracy]	73.9	DepthG w/ 3D-LHP (ViT-S/8)
Semantic Segmentation	COCO-Stuff-27	Linear Classifier [mIoU]	37.8	DepthG w/ 3D-LHP (ViT-S/8)
Semantic Segmentation	COCO-Stuff-27	Clustering [Accuracy]	56.3	DepthG (ViT-S/8)
Semantic Segmentation	COCO-Stuff-27	Clustering [mIoU]	25.6	DepthG (ViT-S/8)
Semantic Segmentation	COCO-Stuff-27	Linear Classifier [Accuracy]	73.7	DepthG (ViT-S/8)
Semantic Segmentation	COCO-Stuff-27	Linear Classifier [mIoU]	38.9	DepthG (ViT-S/8)
Unsupervised Semantic Segmentation	COCO-Stuff-27	Clustering [Accuracy]	58.6	DepthG (ViT-B/8)
Unsupervised Semantic Segmentation	COCO-Stuff-27	Clustering [mIoU]	29	DepthG (ViT-B/8)
Unsupervised Semantic Segmentation	COCO-Stuff-27	Linear Classifier [Accuracy]	75.5	DepthG (ViT-B/8)
Unsupervised Semantic Segmentation	COCO-Stuff-27	Linear Classifier [mIoU]	41.6	DepthG (ViT-B/8)
Unsupervised Semantic Segmentation	COCO-Stuff-27	Clustering [Accuracy]	55.1	DepthG w/ 3D-LHP (ViT-S/8)
Unsupervised Semantic Segmentation	COCO-Stuff-27	Clustering [mIoU]	26.7	DepthG w/ 3D-LHP (ViT-S/8)
Unsupervised Semantic Segmentation	COCO-Stuff-27	Linear Classifier [Accuracy]	73.9	DepthG w/ 3D-LHP (ViT-S/8)
Unsupervised Semantic Segmentation	COCO-Stuff-27	Linear Classifier [mIoU]	37.8	DepthG w/ 3D-LHP (ViT-S/8)
Unsupervised Semantic Segmentation	COCO-Stuff-27	Clustering [Accuracy]	56.3	DepthG (ViT-S/8)
Unsupervised Semantic Segmentation	COCO-Stuff-27	Clustering [mIoU]	25.6	DepthG (ViT-S/8)
Unsupervised Semantic Segmentation	COCO-Stuff-27	Linear Classifier [Accuracy]	73.7	DepthG (ViT-S/8)
Unsupervised Semantic Segmentation	COCO-Stuff-27	Linear Classifier [mIoU]	38.9	DepthG (ViT-S/8)
10-shot image generation	COCO-Stuff-27	Clustering [Accuracy]	58.6	DepthG (ViT-B/8)
10-shot image generation	COCO-Stuff-27	Clustering [mIoU]	29	DepthG (ViT-B/8)
10-shot image generation	COCO-Stuff-27	Linear Classifier [Accuracy]	75.5	DepthG (ViT-B/8)
10-shot image generation	COCO-Stuff-27	Linear Classifier [mIoU]	41.6	DepthG (ViT-B/8)
10-shot image generation	COCO-Stuff-27	Clustering [Accuracy]	55.1	DepthG w/ 3D-LHP (ViT-S/8)
10-shot image generation	COCO-Stuff-27	Clustering [mIoU]	26.7	DepthG w/ 3D-LHP (ViT-S/8)
10-shot image generation	COCO-Stuff-27	Linear Classifier [Accuracy]	73.9	DepthG w/ 3D-LHP (ViT-S/8)
10-shot image generation	COCO-Stuff-27	Linear Classifier [mIoU]	37.8	DepthG w/ 3D-LHP (ViT-S/8)
10-shot image generation	COCO-Stuff-27	Clustering [Accuracy]	56.3	DepthG (ViT-S/8)
10-shot image generation	COCO-Stuff-27	Clustering [mIoU]	25.6	DepthG (ViT-S/8)
10-shot image generation	COCO-Stuff-27	Linear Classifier [Accuracy]	73.7	DepthG (ViT-S/8)
10-shot image generation	COCO-Stuff-27	Linear Classifier [mIoU]	38.9	DepthG (ViT-S/8)
Unsupervised Panoptic Segmentation	Cityscapes	PQ	16.1	DepthG + CutLER
2D Panoptic Segmentation	Cityscapes	PQ	16.1	DepthG + CutLER

Abstract

Results

Task	Dataset	Metric	Value	Model
Semantic Segmentation	COCO-Stuff-27	Clustering [Accuracy]	58.6	DepthG (ViT-B/8)
Semantic Segmentation	COCO-Stuff-27	Clustering [mIoU]	29	DepthG (ViT-B/8)
Semantic Segmentation	COCO-Stuff-27	Linear Classifier [Accuracy]	75.5	DepthG (ViT-B/8)
Semantic Segmentation	COCO-Stuff-27	Linear Classifier [mIoU]	41.6	DepthG (ViT-B/8)
Semantic Segmentation	COCO-Stuff-27	Clustering [Accuracy]	55.1	DepthG w/ 3D-LHP (ViT-S/8)
Semantic Segmentation	COCO-Stuff-27	Clustering [mIoU]	26.7	DepthG w/ 3D-LHP (ViT-S/8)
Semantic Segmentation	COCO-Stuff-27	Linear Classifier [Accuracy]	73.9	DepthG w/ 3D-LHP (ViT-S/8)
Semantic Segmentation	COCO-Stuff-27	Linear Classifier [mIoU]	37.8	DepthG w/ 3D-LHP (ViT-S/8)
Semantic Segmentation	COCO-Stuff-27	Clustering [Accuracy]	56.3	DepthG (ViT-S/8)
Semantic Segmentation	COCO-Stuff-27	Clustering [mIoU]	25.6	DepthG (ViT-S/8)
Semantic Segmentation	COCO-Stuff-27	Linear Classifier [Accuracy]	73.7	DepthG (ViT-S/8)
Semantic Segmentation	COCO-Stuff-27	Linear Classifier [mIoU]	38.9	DepthG (ViT-S/8)
Unsupervised Semantic Segmentation	COCO-Stuff-27	Clustering [Accuracy]	58.6	DepthG (ViT-B/8)
Unsupervised Semantic Segmentation	COCO-Stuff-27	Clustering [mIoU]	29	DepthG (ViT-B/8)
Unsupervised Semantic Segmentation	COCO-Stuff-27	Linear Classifier [Accuracy]	75.5	DepthG (ViT-B/8)
Unsupervised Semantic Segmentation	COCO-Stuff-27	Linear Classifier [mIoU]	41.6	DepthG (ViT-B/8)
Unsupervised Semantic Segmentation	COCO-Stuff-27	Clustering [Accuracy]	55.1	DepthG w/ 3D-LHP (ViT-S/8)
Unsupervised Semantic Segmentation	COCO-Stuff-27	Clustering [mIoU]	26.7	DepthG w/ 3D-LHP (ViT-S/8)
Unsupervised Semantic Segmentation	COCO-Stuff-27	Linear Classifier [Accuracy]	73.9	DepthG w/ 3D-LHP (ViT-S/8)
Unsupervised Semantic Segmentation	COCO-Stuff-27	Linear Classifier [mIoU]	37.8	DepthG w/ 3D-LHP (ViT-S/8)
Unsupervised Semantic Segmentation	COCO-Stuff-27	Clustering [Accuracy]	56.3	DepthG (ViT-S/8)
Unsupervised Semantic Segmentation	COCO-Stuff-27	Clustering [mIoU]	25.6	DepthG (ViT-S/8)
Unsupervised Semantic Segmentation	COCO-Stuff-27	Linear Classifier [Accuracy]	73.7	DepthG (ViT-S/8)
Unsupervised Semantic Segmentation	COCO-Stuff-27	Linear Classifier [mIoU]	38.9	DepthG (ViT-S/8)
10-shot image generation	COCO-Stuff-27	Clustering [Accuracy]	58.6	DepthG (ViT-B/8)
10-shot image generation	COCO-Stuff-27	Clustering [mIoU]	29	DepthG (ViT-B/8)
10-shot image generation	COCO-Stuff-27	Linear Classifier [Accuracy]	75.5	DepthG (ViT-B/8)
10-shot image generation	COCO-Stuff-27	Linear Classifier [mIoU]	41.6	DepthG (ViT-B/8)
10-shot image generation	COCO-Stuff-27	Clustering [Accuracy]	55.1	DepthG w/ 3D-LHP (ViT-S/8)
10-shot image generation	COCO-Stuff-27	Clustering [mIoU]	26.7	DepthG w/ 3D-LHP (ViT-S/8)
10-shot image generation	COCO-Stuff-27	Linear Classifier [Accuracy]	73.9	DepthG w/ 3D-LHP (ViT-S/8)
10-shot image generation	COCO-Stuff-27	Linear Classifier [mIoU]	37.8	DepthG w/ 3D-LHP (ViT-S/8)
10-shot image generation	COCO-Stuff-27	Clustering [Accuracy]	56.3	DepthG (ViT-S/8)
10-shot image generation	COCO-Stuff-27	Clustering [mIoU]	25.6	DepthG (ViT-S/8)
10-shot image generation	COCO-Stuff-27	Linear Classifier [Accuracy]	73.7	DepthG (ViT-S/8)
10-shot image generation	COCO-Stuff-27	Linear Classifier [mIoU]	38.9	DepthG (ViT-S/8)
Unsupervised Panoptic Segmentation	Cityscapes	PQ	16.1	DepthG + CutLER
2D Panoptic Segmentation	Cityscapes	PQ	16.1	DepthG + CutLER

Unsupervised Semantic Segmentation Through Depth-Guided Feature Correlation and Sampling

Abstract

Results

Related Papers

Unsupervised Semantic Segmentation Through Depth-Guided Feature Correlation and Sampling

Abstract

Results

Related Papers