TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/CRIS: CLIP-Driven Referring Image Segmentation

CRIS: CLIP-Driven Referring Image Segmentation

Zhaoqing Wang, Yu Lu, Qiang Li, Xunqiang Tao, Yandong Guo, Mingming Gong, Tongliang Liu

2021-11-30CVPR 2022 1Generalized Referring Expression SegmentationReferring Expression SegmentationSegmentationSemantic SegmentationContrastive LearningImage Segmentation
PaperPDFCode(official)

Abstract

Referring image segmentation aims to segment a referent via a natural linguistic expression.Due to the distinct data properties between text and image, it is challenging for a network to well align text and pixel-level features. Existing approaches use pretrained models to facilitate learning, yet separately transfer the language/vision knowledge from pretrained models, ignoring the multi-modal corresponding information. Inspired by the recent advance in Contrastive Language-Image Pretraining (CLIP), in this paper, we propose an end-to-end CLIP-Driven Referring Image Segmentation framework (CRIS). To transfer the multi-modal knowledge effectively, CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment. More specifically, we design a vision-language decoder to propagate fine-grained semantic information from textual representations to each pixel-level activation, which promotes consistency between the two modalities. In addition, we present text-to-pixel contrastive learning to explicitly enforce the text feature similar to the related pixel-level features and dissimilar to the irrelevances. The experimental results on three benchmark datasets demonstrate that our proposed framework significantly outperforms the state-of-the-art performance without any post-processing. The code will be released.

Results

TaskDatasetMetricValueModel
Instance SegmentationRefCoCo valOverall IoU70.47CRIS
Instance SegmentationRefCOCO+ valOverall IoU62.27CRIS
Instance SegmentationRefCOCO+ test BOverall IoU53.68CRIS
Instance SegmentationRefCOCO+ testAOverall IoU68.08CRIS
Instance SegmentationgRefCOCOcIoU55.34CRIS
Instance SegmentationgRefCOCOgIoU56.27CRIS
Referring Expression SegmentationRefCoCo valOverall IoU70.47CRIS
Referring Expression SegmentationRefCOCO+ valOverall IoU62.27CRIS
Referring Expression SegmentationRefCOCO+ test BOverall IoU53.68CRIS
Referring Expression SegmentationRefCOCO+ testAOverall IoU68.08CRIS
Referring Expression SegmentationgRefCOCOcIoU55.34CRIS
Referring Expression SegmentationgRefCOCOgIoU56.27CRIS

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21Deep Learning-Based Fetal Lung Segmentation from Diffusion-weighted MRI Images and Lung Maturity Evaluation for Fetal Growth Restriction2025-07-17DiffOSeg: Omni Medical Image Segmentation via Multi-Expert Collaboration Diffusion Model2025-07-17From Variability To Accuracy: Conditional Bernoulli Diffusion Models with Consensus-Driven Correction for Thin Structure Segmentation2025-07-17Unleashing Vision Foundation Models for Coronary Artery Segmentation: Parallel ViT-CNN Encoding and Variational Fusion2025-07-17SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation2025-07-17Unified Medical Image Segmentation with State Space Modeling Snake2025-07-17A Privacy-Preserving Semantic-Segmentation Method Using Domain-Adaptation Technique2025-07-17