CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction

Size Wu, Wenwei Zhang, Lumin Xu, Sheng Jin, Xiangtai Li, Wentao Liu, Chen Change Loy

2023-10-02Panoptic Segmentation Image Classification Zero-Shot Image Classification Open Vocabulary Semantic Segmentation Segmentation Semantic Segmentation Open Vocabulary Panoptic Segmentation Prediction Open Vocabulary Object Detection object-detection Object Detection Image Segmentation

Paper PDF Code(official)

Abstract

Open-vocabulary dense prediction tasks including object detection and image segmentation have been advanced by the success of Contrastive Language-Image Pre-training (CLIP). CLIP models, particularly those incorporating vision transformers (ViTs), have exhibited remarkable generalization ability in zero-shot image classification. However, when transferring the vision-language alignment of CLIP from global image representation to local region representation for the open-vocabulary dense prediction tasks, CLIP ViTs suffer from the domain shift from full images to local image regions. In this paper, we embark on an in-depth analysis of the region-language alignment in CLIP models, which is essential for downstream open-vocabulary dense prediction tasks. Subsequently, we propose an approach named CLIPSelf, which adapts the image-level recognition ability of CLIP ViT to local image regions without needing any region-text pairs. CLIPSelf empowers ViTs to distill itself by aligning a region representation extracted from its dense feature map with the image-level representation of the corresponding image crop. With the enhanced CLIP ViTs, we achieve new state-of-the-art performance on open-vocabulary object detection, semantic segmentation, and panoptic segmentation across various benchmarks. Models and code are released at https://github.com/wusize/CLIPSelf.

Results

Task	Dataset	Metric	Value	Model
Object Detection	LVIS v1.0	AP novel-LVIS base training	34.9	CLIPSelf
Object Detection	MSCOCO	AP 0.5	44.3	CLIPSelf
Open Vocabulary Panoptic Segmentation	ADE20K	PQ	23.7	CLIPSelf
3D	LVIS v1.0	AP novel-LVIS base training	34.9	CLIPSelf
3D	MSCOCO	AP 0.5	44.3	CLIPSelf
2D Classification	LVIS v1.0	AP novel-LVIS base training	34.9	CLIPSelf
2D Classification	MSCOCO	AP 0.5	44.3	CLIPSelf
2D Object Detection	LVIS v1.0	AP novel-LVIS base training	34.9	CLIPSelf
2D Object Detection	MSCOCO	AP 0.5	44.3	CLIPSelf
Open Vocabulary Object Detection	LVIS v1.0	AP novel-LVIS base training	34.9	CLIPSelf
Open Vocabulary Object Detection	MSCOCO	AP 0.5	44.3	CLIPSelf
Open Vocabulary Semantic Segmentation	ADE20K-847	mIoU	12.4	CLIPSelf
Open Vocabulary Semantic Segmentation	PASCAL Context-59	mIoU	62.3	CLIPSelf
Open Vocabulary Semantic Segmentation	ADE20K-150	mIoU	34.5	CLIPSelf
16k	LVIS v1.0	AP novel-LVIS base training	34.9	CLIPSelf
16k	MSCOCO	AP 0.5	44.3	CLIPSelf

CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction

Abstract

Results

Related Papers

CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction

Abstract

Results

Related Papers