TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/CLIPSelf: Vision Transformer Distills Itself for Open-Voca...

CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction

Size Wu, Wenwei Zhang, Lumin Xu, Sheng Jin, Xiangtai Li, Wentao Liu, Chen Change Loy

2023-10-02Panoptic SegmentationImage ClassificationZero-Shot Image ClassificationOpen Vocabulary Semantic SegmentationSegmentationSemantic SegmentationOpen Vocabulary Panoptic SegmentationPredictionOpen Vocabulary Object Detectionobject-detectionObject DetectionImage Segmentation
PaperPDFCode(official)

Abstract

Open-vocabulary dense prediction tasks including object detection and image segmentation have been advanced by the success of Contrastive Language-Image Pre-training (CLIP). CLIP models, particularly those incorporating vision transformers (ViTs), have exhibited remarkable generalization ability in zero-shot image classification. However, when transferring the vision-language alignment of CLIP from global image representation to local region representation for the open-vocabulary dense prediction tasks, CLIP ViTs suffer from the domain shift from full images to local image regions. In this paper, we embark on an in-depth analysis of the region-language alignment in CLIP models, which is essential for downstream open-vocabulary dense prediction tasks. Subsequently, we propose an approach named CLIPSelf, which adapts the image-level recognition ability of CLIP ViT to local image regions without needing any region-text pairs. CLIPSelf empowers ViTs to distill itself by aligning a region representation extracted from its dense feature map with the image-level representation of the corresponding image crop. With the enhanced CLIP ViTs, we achieve new state-of-the-art performance on open-vocabulary object detection, semantic segmentation, and panoptic segmentation across various benchmarks. Models and code are released at https://github.com/wusize/CLIPSelf.

Results

TaskDatasetMetricValueModel
Object DetectionLVIS v1.0AP novel-LVIS base training34.9CLIPSelf
Object DetectionMSCOCOAP 0.544.3CLIPSelf
Open Vocabulary Panoptic SegmentationADE20KPQ23.7CLIPSelf
3DLVIS v1.0AP novel-LVIS base training34.9CLIPSelf
3DMSCOCOAP 0.544.3CLIPSelf
2D ClassificationLVIS v1.0AP novel-LVIS base training34.9CLIPSelf
2D ClassificationMSCOCOAP 0.544.3CLIPSelf
2D Object DetectionLVIS v1.0AP novel-LVIS base training34.9CLIPSelf
2D Object DetectionMSCOCOAP 0.544.3CLIPSelf
Open Vocabulary Object DetectionLVIS v1.0AP novel-LVIS base training34.9CLIPSelf
Open Vocabulary Object DetectionMSCOCOAP 0.544.3CLIPSelf
Open Vocabulary Semantic SegmentationADE20K-847mIoU12.4CLIPSelf
Open Vocabulary Semantic SegmentationPASCAL Context-59mIoU62.3CLIPSelf
Open Vocabulary Semantic SegmentationADE20K-150mIoU34.5CLIPSelf
16kLVIS v1.0AP novel-LVIS base training34.9CLIPSelf
16kMSCOCOAP 0.544.3CLIPSelf

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21Multi-Strategy Improved Snake Optimizer Accelerated CNN-LSTM-Attention-Adaboost for Trajectory Prediction2025-07-21Automatic Classification and Segmentation of Tunnel Cracks Based on Deep Learning and Visual Explanations2025-07-18Adversarial attacks to image classification systems using evolutionary algorithms2025-07-17Efficient Adaptation of Pre-trained Vision Transformer underpinned by Approximately Orthogonal Fine-Tuning Strategy2025-07-17Federated Learning for Commercial Image Sources2025-07-17MUPAX: Multidimensional Problem Agnostic eXplainable AI2025-07-17Deep Learning-Based Fetal Lung Segmentation from Diffusion-weighted MRI Images and Lung Maturity Evaluation for Fetal Growth Restriction2025-07-17