TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/COSMOS: Cross-Modality Self-Distillation for Vision Langua...

COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training

Sanghwan Kim, Rui Xiao, Mariana-Iuliana Georgescu, Stephan Alaniz, Zeynep Akata

2024-12-02CVPR 2025 1Zero-Shot Cross-Modal RetrievalUnsupervised Semantic Segmentation with Language-image Pre-trainingSelf-Supervised LearningZero Shot SegmentationSemantic Segmentation
PaperPDFCode(official)

Abstract

Vision-Language Models (VLMs) trained with contrastive loss have achieved significant advancements in various vision and language tasks. However, the global nature of the contrastive loss makes VLMs focus predominantly on foreground objects, neglecting other crucial information in the image, which limits their effectiveness in downstream tasks. To address these challenges, we propose COSMOS: CrOSs-MOdality Self-distillation for vision-language pre-training that integrates a novel text-cropping strategy and cross-attention module into a self-supervised learning framework. We create global and local views of images and texts (i.e., multi-modal augmentations), which are essential for self-distillation in VLMs. We further introduce a cross-attention module, enabling COSMOS to learn comprehensive cross-modal representations optimized via a cross-modality self-distillation loss. COSMOS consistently outperforms previous strong baselines on various zero-shot downstream tasks, including retrieval, classification, and semantic segmentation. Additionally, it surpasses CLIP-based models trained on larger datasets in visual perception and contextual understanding tasks. Code is available at https://github.com/ExplainableML/cosmos.

Results

TaskDatasetMetricValueModel
Semantic SegmentationCOCO-Stuff-171mIoU23.2COSMOS ViT-B/16
Semantic SegmentationCOCO-ObjectmIoU31.3COSMOS ViT-B/16
Semantic SegmentationADE20KMean IoU (val)17.7COSMOS ViT-B/16
Semantic SegmentationCityscapes valmIoU34.7COSMOS ViT-B/16
Semantic SegmentationPASCAL Context-59mIoU33.7COSMOS ViT-B/16
Semantic SegmentationPascalVOC-20mIoU77.7COSMOS ViT-B/16
Image Retrieval with Multi-Modal QueryFlickr30kImage-to-text R@192.9COSMOS ViT-B/16
Image Retrieval with Multi-Modal QueryFlickr30kImage-to-text R@1099.9COSMOS ViT-B/16
Image Retrieval with Multi-Modal QueryFlickr30kImage-to-text R@599.4COSMOS ViT-B/16
Image Retrieval with Multi-Modal QueryFlickr30kText-to-image R@180.3COSMOS ViT-B/16
Image Retrieval with Multi-Modal QueryFlickr30kText-to-image R@1097.6COSMOS ViT-B/16
Image Retrieval with Multi-Modal QueryFlickr30kText-to-image R@595.3COSMOS ViT-B/16
Image Retrieval with Multi-Modal QueryFlickr30kImage-to-text R@189.9COSMOS ViT-B/32
Image Retrieval with Multi-Modal QueryFlickr30kImage-to-text R@1099.3COSMOS ViT-B/32
Image Retrieval with Multi-Modal QueryFlickr30kImage-to-text R@598.8COSMOS ViT-B/32
Image Retrieval with Multi-Modal QueryFlickr30kText-to-image R@176.1COSMOS ViT-B/32
Image Retrieval with Multi-Modal QueryFlickr30kText-to-image R@1096.2COSMOS ViT-B/32
Image Retrieval with Multi-Modal QueryFlickr30kText-to-image R@592.8COSMOS ViT-B/32
Image Retrieval with Multi-Modal QueryCOCO 2014Image-to-text R@168COSMOS ViT-B/16
Image Retrieval with Multi-Modal QueryCOCO 2014Image-to-text R@1092.5COSMOS ViT-B/16
Image Retrieval with Multi-Modal QueryCOCO 2014Image-to-text R@587.8COSMOS ViT-B/16
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@152.5COSMOS ViT-B/16
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@1084.9COSMOS ViT-B/16
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@577.2COSMOS ViT-B/16
Image Retrieval with Multi-Modal QueryCOCO 2014Image-to-text R@164.3COSMOS ViT-B/32
Image Retrieval with Multi-Modal QueryCOCO 2014Image-to-text R@1092COSMOS ViT-B/32
Image Retrieval with Multi-Modal QueryCOCO 2014Image-to-text R@586.5COSMOS ViT-B/32
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@148.4COSMOS ViT-B/32
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@1082.6COSMOS ViT-B/32
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@574.2COSMOS ViT-B/32
Zero Shot SegmentationADE20K training-free zero-shot segmentationmIoU17.7COSMOS ViT-B/16
Unsupervised Semantic SegmentationCOCO-Stuff-171mIoU23.2COSMOS ViT-B/16
Unsupervised Semantic SegmentationCOCO-ObjectmIoU31.3COSMOS ViT-B/16
Unsupervised Semantic SegmentationADE20KMean IoU (val)17.7COSMOS ViT-B/16
Unsupervised Semantic SegmentationCityscapes valmIoU34.7COSMOS ViT-B/16
Unsupervised Semantic SegmentationPASCAL Context-59mIoU33.7COSMOS ViT-B/16
Unsupervised Semantic SegmentationPascalVOC-20mIoU77.7COSMOS ViT-B/16
10-shot image generationCOCO-Stuff-171mIoU23.2COSMOS ViT-B/16
10-shot image generationCOCO-ObjectmIoU31.3COSMOS ViT-B/16
10-shot image generationADE20KMean IoU (val)17.7COSMOS ViT-B/16
10-shot image generationCityscapes valmIoU34.7COSMOS ViT-B/16
10-shot image generationPASCAL Context-59mIoU33.7COSMOS ViT-B/16
10-shot image generationPascalVOC-20mIoU77.7COSMOS ViT-B/16

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21A Semi-Supervised Learning Method for the Identification of Bad Exposures in Large Imaging Surveys2025-07-17DiffOSeg: Omni Medical Image Segmentation via Multi-Expert Collaboration Diffusion Model2025-07-17SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation2025-07-17Unified Medical Image Segmentation with State Space Modeling Snake2025-07-17A Privacy-Preserving Semantic-Segmentation Method Using Domain-Adaptation Technique2025-07-17SAMST: A Transformer framework based on SAM pseudo label filtering for remote sensing semi-supervised semantic segmentation2025-07-16Tomato Multi-Angle Multi-Pose Dataset for Fine-Grained Phenotyping2025-07-15