Perceptual Grouping in Contrastive Vision-Language Models

Kanchana Ranasinghe, Brandon McKinzie, Sachin Ravi, Yinfei Yang, Alexander Toshev, Jonathon Shlens

2022-10-18ICCV 2023 1Unsupervised Semantic Segmentation with Language-image Pre-training Representation Learning Unsupervised Semantic Segmentation Object Localization

Paper PDF Code Code

Abstract

Recent advances in zero-shot image recognition suggest that vision-language models learn generic visual representations with a high degree of semantic information that may be arbitrarily probed with natural language phrases. Understanding an image, however, is not just about understanding what content resides within an image, but importantly, where that content resides. In this work we examine how well vision-language models are able to understand where objects reside within an image and group together visually related parts of the imagery. We demonstrate how contemporary vision and language representation learning models based on contrastive losses and large web-based data capture limited object localization information. We propose a minimal set of modifications that results in models that uniquely learn both semantic and spatial information. We measure this performance in terms of zero-shot image recognition, unsupervised bottom-up and top-down semantic segmentations, as well as robustness analyses. We find that the resulting model achieves state-of-the-art results in terms of unsupervised segmentation, and demonstrate that the learned representations are uniquely robust to spurious correlations in datasets designed to probe the causal behavior of vision models.

Results

Task	Dataset	Metric	Value	Model
Semantic Segmentation	COCO (Common Objects in Context)	Mean IoU (val)	25.5	CLIPpy ViT-B
Semantic Segmentation	PASCAL VOC 2007	Mean IoU (val)	52.2	CLIPpy ViT-B
Semantic Segmentation	ADE20K	Mean IoU (val)	13.5	CLIPpy ViT-B
Semantic Segmentation	Cityscapes val	mIoU	18.1	CLIPpy ViT-B
Unsupervised Semantic Segmentation	COCO (Common Objects in Context)	Mean IoU (val)	25.5	CLIPpy ViT-B
Unsupervised Semantic Segmentation	PASCAL VOC 2007	Mean IoU (val)	52.2	CLIPpy ViT-B
Unsupervised Semantic Segmentation	ADE20K	Mean IoU (val)	13.5	CLIPpy ViT-B
Unsupervised Semantic Segmentation	Cityscapes val	mIoU	18.1	CLIPpy ViT-B
10-shot image generation	COCO (Common Objects in Context)	Mean IoU (val)	25.5	CLIPpy ViT-B
10-shot image generation	PASCAL VOC 2007	Mean IoU (val)	52.2	CLIPpy ViT-B
10-shot image generation	ADE20K	Mean IoU (val)	13.5	CLIPpy ViT-B
10-shot image generation	Cityscapes val	mIoU	18.1	CLIPpy ViT-B

Perceptual Grouping in Contrastive Vision-Language Models

Abstract

Results

Related Papers

Perceptual Grouping in Contrastive Vision-Language Models

Abstract

Results

Related Papers