CRIS: CLIP-Driven Referring Image Segmentation

Zhaoqing Wang, Yu Lu, Qiang Li, Xunqiang Tao, Yandong Guo, Mingming Gong, Tongliang Liu

2021-11-30CVPR 2022 1Generalized Referring Expression Segmentation Referring Expression Segmentation Segmentation Semantic Segmentation Contrastive Learning Image Segmentation

Paper PDF Code(official)

Abstract

Referring image segmentation aims to segment a referent via a natural linguistic expression.Due to the distinct data properties between text and image, it is challenging for a network to well align text and pixel-level features. Existing approaches use pretrained models to facilitate learning, yet separately transfer the language/vision knowledge from pretrained models, ignoring the multi-modal corresponding information. Inspired by the recent advance in Contrastive Language-Image Pretraining (CLIP), in this paper, we propose an end-to-end CLIP-Driven Referring Image Segmentation framework (CRIS). To transfer the multi-modal knowledge effectively, CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment. More specifically, we design a vision-language decoder to propagate fine-grained semantic information from textual representations to each pixel-level activation, which promotes consistency between the two modalities. In addition, we present text-to-pixel contrastive learning to explicitly enforce the text feature similar to the related pixel-level features and dissimilar to the irrelevances. The experimental results on three benchmark datasets demonstrate that our proposed framework significantly outperforms the state-of-the-art performance without any post-processing. The code will be released.

Results

Task	Dataset	Metric	Value	Model
Instance Segmentation	RefCoCo val	Overall IoU	70.47	CRIS
Instance Segmentation	RefCOCO+ val	Overall IoU	62.27	CRIS
Instance Segmentation	RefCOCO+ test B	Overall IoU	53.68	CRIS
Instance Segmentation	RefCOCO+ testA	Overall IoU	68.08	CRIS
Instance Segmentation	gRefCOCO	cIoU	55.34	CRIS
Instance Segmentation	gRefCOCO	gIoU	56.27	CRIS
Referring Expression Segmentation	RefCoCo val	Overall IoU	70.47	CRIS
Referring Expression Segmentation	RefCOCO+ val	Overall IoU	62.27	CRIS
Referring Expression Segmentation	RefCOCO+ test B	Overall IoU	53.68	CRIS
Referring Expression Segmentation	RefCOCO+ testA	Overall IoU	68.08	CRIS
Referring Expression Segmentation	gRefCOCO	cIoU	55.34	CRIS
Referring Expression Segmentation	gRefCOCO	gIoU	56.27	CRIS

CRIS: CLIP-Driven Referring Image Segmentation

Abstract

Results

Related Papers

CRIS: CLIP-Driven Referring Image Segmentation

Abstract

Results

Related Papers