TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Perceptual Grouping in Contrastive Vision-Language Models

Perceptual Grouping in Contrastive Vision-Language Models

Kanchana Ranasinghe, Brandon McKinzie, Sachin Ravi, Yinfei Yang, Alexander Toshev, Jonathon Shlens

2022-10-18ICCV 2023 1Unsupervised Semantic Segmentation with Language-image Pre-trainingRepresentation LearningUnsupervised Semantic SegmentationObject Localization
PaperPDFCodeCode

Abstract

Recent advances in zero-shot image recognition suggest that vision-language models learn generic visual representations with a high degree of semantic information that may be arbitrarily probed with natural language phrases. Understanding an image, however, is not just about understanding what content resides within an image, but importantly, where that content resides. In this work we examine how well vision-language models are able to understand where objects reside within an image and group together visually related parts of the imagery. We demonstrate how contemporary vision and language representation learning models based on contrastive losses and large web-based data capture limited object localization information. We propose a minimal set of modifications that results in models that uniquely learn both semantic and spatial information. We measure this performance in terms of zero-shot image recognition, unsupervised bottom-up and top-down semantic segmentations, as well as robustness analyses. We find that the resulting model achieves state-of-the-art results in terms of unsupervised segmentation, and demonstrate that the learned representations are uniquely robust to spurious correlations in datasets designed to probe the causal behavior of vision models.

Results

TaskDatasetMetricValueModel
Semantic SegmentationCOCO (Common Objects in Context)Mean IoU (val)25.5CLIPpy ViT-B
Semantic SegmentationPASCAL VOC 2007Mean IoU (val)52.2CLIPpy ViT-B
Semantic SegmentationADE20KMean IoU (val)13.5CLIPpy ViT-B
Semantic SegmentationCityscapes valmIoU18.1CLIPpy ViT-B
Unsupervised Semantic SegmentationCOCO (Common Objects in Context)Mean IoU (val)25.5CLIPpy ViT-B
Unsupervised Semantic SegmentationPASCAL VOC 2007Mean IoU (val)52.2CLIPpy ViT-B
Unsupervised Semantic SegmentationADE20KMean IoU (val)13.5CLIPpy ViT-B
Unsupervised Semantic SegmentationCityscapes valmIoU18.1CLIPpy ViT-B
10-shot image generationCOCO (Common Objects in Context)Mean IoU (val)25.5CLIPpy ViT-B
10-shot image generationPASCAL VOC 2007Mean IoU (val)52.2CLIPpy ViT-B
10-shot image generationADE20KMean IoU (val)13.5CLIPpy ViT-B
10-shot image generationCityscapes valmIoU18.1CLIPpy ViT-B

Related Papers

Touch in the Wild: Learning Fine-Grained Manipulation with a Portable Visuo-Tactile Gripper2025-07-20Spectral Bellman Method: Unifying Representation and Exploration in RL2025-07-17Boosting Team Modeling through Tempo-Relational Representation Learning2025-07-17Similarity-Guided Diffusion for Contrastive Sequential Recommendation2025-07-16Are encoders able to learn landmarkers for warm-starting of Hyperparameter Optimization?2025-07-16Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16A Mixed-Primitive-based Gaussian Splatting Method for Surface Reconstruction2025-07-15Dual Dimensions Geometric Representation Learning Based Document Dewarping2025-07-11