Sanghwan Kim, Rui Xiao, Mariana-Iuliana Georgescu, Stephan Alaniz, Zeynep Akata
Vision-Language Models (VLMs) trained with contrastive loss have achieved significant advancements in various vision and language tasks. However, the global nature of the contrastive loss makes VLMs focus predominantly on foreground objects, neglecting other crucial information in the image, which limits their effectiveness in downstream tasks. To address these challenges, we propose COSMOS: CrOSs-MOdality Self-distillation for vision-language pre-training that integrates a novel text-cropping strategy and cross-attention module into a self-supervised learning framework. We create global and local views of images and texts (i.e., multi-modal augmentations), which are essential for self-distillation in VLMs. We further introduce a cross-attention module, enabling COSMOS to learn comprehensive cross-modal representations optimized via a cross-modality self-distillation loss. COSMOS consistently outperforms previous strong baselines on various zero-shot downstream tasks, including retrieval, classification, and semantic segmentation. Additionally, it surpasses CLIP-based models trained on larger datasets in visual perception and contextual understanding tasks. Code is available at https://github.com/ExplainableML/cosmos.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Semantic Segmentation | COCO-Stuff-171 | mIoU | 23.2 | COSMOS ViT-B/16 |
| Semantic Segmentation | COCO-Object | mIoU | 31.3 | COSMOS ViT-B/16 |
| Semantic Segmentation | ADE20K | Mean IoU (val) | 17.7 | COSMOS ViT-B/16 |
| Semantic Segmentation | Cityscapes val | mIoU | 34.7 | COSMOS ViT-B/16 |
| Semantic Segmentation | PASCAL Context-59 | mIoU | 33.7 | COSMOS ViT-B/16 |
| Semantic Segmentation | PascalVOC-20 | mIoU | 77.7 | COSMOS ViT-B/16 |
| Image Retrieval with Multi-Modal Query | Flickr30k | Image-to-text R@1 | 92.9 | COSMOS ViT-B/16 |
| Image Retrieval with Multi-Modal Query | Flickr30k | Image-to-text R@10 | 99.9 | COSMOS ViT-B/16 |
| Image Retrieval with Multi-Modal Query | Flickr30k | Image-to-text R@5 | 99.4 | COSMOS ViT-B/16 |
| Image Retrieval with Multi-Modal Query | Flickr30k | Text-to-image R@1 | 80.3 | COSMOS ViT-B/16 |
| Image Retrieval with Multi-Modal Query | Flickr30k | Text-to-image R@10 | 97.6 | COSMOS ViT-B/16 |
| Image Retrieval with Multi-Modal Query | Flickr30k | Text-to-image R@5 | 95.3 | COSMOS ViT-B/16 |
| Image Retrieval with Multi-Modal Query | Flickr30k | Image-to-text R@1 | 89.9 | COSMOS ViT-B/32 |
| Image Retrieval with Multi-Modal Query | Flickr30k | Image-to-text R@10 | 99.3 | COSMOS ViT-B/32 |
| Image Retrieval with Multi-Modal Query | Flickr30k | Image-to-text R@5 | 98.8 | COSMOS ViT-B/32 |
| Image Retrieval with Multi-Modal Query | Flickr30k | Text-to-image R@1 | 76.1 | COSMOS ViT-B/32 |
| Image Retrieval with Multi-Modal Query | Flickr30k | Text-to-image R@10 | 96.2 | COSMOS ViT-B/32 |
| Image Retrieval with Multi-Modal Query | Flickr30k | Text-to-image R@5 | 92.8 | COSMOS ViT-B/32 |
| Image Retrieval with Multi-Modal Query | COCO 2014 | Image-to-text R@1 | 68 | COSMOS ViT-B/16 |
| Image Retrieval with Multi-Modal Query | COCO 2014 | Image-to-text R@10 | 92.5 | COSMOS ViT-B/16 |
| Image Retrieval with Multi-Modal Query | COCO 2014 | Image-to-text R@5 | 87.8 | COSMOS ViT-B/16 |
| Image Retrieval with Multi-Modal Query | COCO 2014 | Text-to-image R@1 | 52.5 | COSMOS ViT-B/16 |
| Image Retrieval with Multi-Modal Query | COCO 2014 | Text-to-image R@10 | 84.9 | COSMOS ViT-B/16 |
| Image Retrieval with Multi-Modal Query | COCO 2014 | Text-to-image R@5 | 77.2 | COSMOS ViT-B/16 |
| Image Retrieval with Multi-Modal Query | COCO 2014 | Image-to-text R@1 | 64.3 | COSMOS ViT-B/32 |
| Image Retrieval with Multi-Modal Query | COCO 2014 | Image-to-text R@10 | 92 | COSMOS ViT-B/32 |
| Image Retrieval with Multi-Modal Query | COCO 2014 | Image-to-text R@5 | 86.5 | COSMOS ViT-B/32 |
| Image Retrieval with Multi-Modal Query | COCO 2014 | Text-to-image R@1 | 48.4 | COSMOS ViT-B/32 |
| Image Retrieval with Multi-Modal Query | COCO 2014 | Text-to-image R@10 | 82.6 | COSMOS ViT-B/32 |
| Image Retrieval with Multi-Modal Query | COCO 2014 | Text-to-image R@5 | 74.2 | COSMOS ViT-B/32 |
| Zero Shot Segmentation | ADE20K training-free zero-shot segmentation | mIoU | 17.7 | COSMOS ViT-B/16 |
| Unsupervised Semantic Segmentation | COCO-Stuff-171 | mIoU | 23.2 | COSMOS ViT-B/16 |
| Unsupervised Semantic Segmentation | COCO-Object | mIoU | 31.3 | COSMOS ViT-B/16 |
| Unsupervised Semantic Segmentation | ADE20K | Mean IoU (val) | 17.7 | COSMOS ViT-B/16 |
| Unsupervised Semantic Segmentation | Cityscapes val | mIoU | 34.7 | COSMOS ViT-B/16 |
| Unsupervised Semantic Segmentation | PASCAL Context-59 | mIoU | 33.7 | COSMOS ViT-B/16 |
| Unsupervised Semantic Segmentation | PascalVOC-20 | mIoU | 77.7 | COSMOS ViT-B/16 |
| 10-shot image generation | COCO-Stuff-171 | mIoU | 23.2 | COSMOS ViT-B/16 |
| 10-shot image generation | COCO-Object | mIoU | 31.3 | COSMOS ViT-B/16 |
| 10-shot image generation | ADE20K | Mean IoU (val) | 17.7 | COSMOS ViT-B/16 |
| 10-shot image generation | Cityscapes val | mIoU | 34.7 | COSMOS ViT-B/16 |
| 10-shot image generation | PASCAL Context-59 | mIoU | 33.7 | COSMOS ViT-B/16 |
| 10-shot image generation | PascalVOC-20 | mIoU | 77.7 | COSMOS ViT-B/16 |