Sanghyun Jo, Soohyun Ryu, Sungyub Kim, Eunho Yang, KyungSu Kim
We identify a critical bias in contemporary CLIP-based models, which we denote as single tag bias. This bias manifests as a disproportionate focus on a singular tag (word) while neglecting other pertinent tags, stemming from CLIP's text embeddings that prioritize one specific tag in image-text relationships. When deconstructing text into individual tags, only one tag tends to have high relevancy with CLIP's image embedding, leading to biased tag relevancy. In this paper, we introduce a novel two-step fine-tuning approach, Text-Tag Self-Distillation (TTD), to address this challenge. TTD first extracts image-relevant tags from text based on their similarity to the nearest pixels then employs a self-distillation strategy to align combined masks with the text-derived mask. This approach ensures the unbiased image-text alignment of the CLIP-based models using only image-text pairs without necessitating additional supervision. Our technique demonstrates model-agnostic improvements in multi-tag classification and segmentation tasks, surpassing competing methods that rely on external resources. The code is available at https://github.com/shjo-april/TTD.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Semantic Segmentation | CC3M-TagMask | mIoU | 65.5 | TTD (TCL) |
| Semantic Segmentation | CC3M-TagMask | mIoU | 50.2 | TTD (MaskCLIP) |
| Semantic Segmentation | COCO-Stuff-171 | mIoU | 23.7 | TTD (TCL) |
| Semantic Segmentation | COCO-Stuff-171 | mIoU | 19.4 | TTD (MaskCLIP) |
| Semantic Segmentation | COCO-Object | mIoU | 37.4 | TTD (TCL) |
| Semantic Segmentation | COCO-Object | mIoU | 26.5 | TTD (MaskCLIP) |
| Semantic Segmentation | ADE20K | Mean IoU (val) | 17 | TTD (TCL) |
| Semantic Segmentation | ADE20K | Mean IoU (val) | 12.7 | TTD (MaskCLIP) |
| Semantic Segmentation | Cityscapes val | mIoU | 32 | TTD (MaskCLIP) |
| Semantic Segmentation | Cityscapes val | mIoU | 27 | TTD (TCL) |
| Semantic Segmentation | PASCAL Context-59 | mIoU | 37.4 | TTD (TCL) |
| Semantic Segmentation | PASCAL Context-59 | mIoU | 31 | TTD (MaskCLIP) |
| Semantic Segmentation | PASCAL VOC | mIoU | 61.1 | TTD (TCL) |
| Semantic Segmentation | PASCAL VOC | mIoU | 43.1 | TTD (MaskCLIP) |
| Multi-Label Text Classification | CC3M-TagMask | Accuracy | 88.6 | TTD (w/ fine-tuning) |
| Multi-Label Text Classification | CC3M-TagMask | F1 | 82.8 | TTD (w/ fine-tuning) |
| Multi-Label Text Classification | CC3M-TagMask | Precision | 88.3 | TTD (w/ fine-tuning) |
| Multi-Label Text Classification | CC3M-TagMask | Recall | 78 | TTD (w/ fine-tuning) |
| Multi-Label Text Classification | CC3M-TagMask | mAP | 93.7 | TTD (w/ fine-tuning) |
| Multi-Label Text Classification | CC3M-TagMask | Accuracy | 91 | TTD (w/o fine-tuning) |
| Multi-Label Text Classification | CC3M-TagMask | F1 | 78.5 | TTD (w/o fine-tuning) |
| Multi-Label Text Classification | CC3M-TagMask | Precision | 82.9 | TTD (w/o fine-tuning) |
| Multi-Label Text Classification | CC3M-TagMask | Recall | 74.5 | TTD (w/o fine-tuning) |
| Multi-Label Text Classification | CC3M-TagMask | mAP | 90.3 | TTD (w/o fine-tuning) |
| Text Classification | CC3M-TagMask | Accuracy | 88.6 | TTD (w/ fine-tuning) |
| Text Classification | CC3M-TagMask | F1 | 82.8 | TTD (w/ fine-tuning) |
| Text Classification | CC3M-TagMask | Precision | 88.3 | TTD (w/ fine-tuning) |
| Text Classification | CC3M-TagMask | Recall | 78 | TTD (w/ fine-tuning) |
| Text Classification | CC3M-TagMask | mAP | 93.7 | TTD (w/ fine-tuning) |
| Text Classification | CC3M-TagMask | Accuracy | 91 | TTD (w/o fine-tuning) |
| Text Classification | CC3M-TagMask | F1 | 78.5 | TTD (w/o fine-tuning) |
| Text Classification | CC3M-TagMask | Precision | 82.9 | TTD (w/o fine-tuning) |
| Text Classification | CC3M-TagMask | Recall | 74.5 | TTD (w/o fine-tuning) |
| Text Classification | CC3M-TagMask | mAP | 90.3 | TTD (w/o fine-tuning) |
| Classification | CC3M-TagMask | Accuracy | 88.6 | TTD (w/ fine-tuning) |
| Classification | CC3M-TagMask | F1 | 82.8 | TTD (w/ fine-tuning) |
| Classification | CC3M-TagMask | Precision | 88.3 | TTD (w/ fine-tuning) |
| Classification | CC3M-TagMask | Recall | 78 | TTD (w/ fine-tuning) |
| Classification | CC3M-TagMask | mAP | 93.7 | TTD (w/ fine-tuning) |
| Classification | CC3M-TagMask | Accuracy | 91 | TTD (w/o fine-tuning) |
| Classification | CC3M-TagMask | F1 | 78.5 | TTD (w/o fine-tuning) |
| Classification | CC3M-TagMask | Precision | 82.9 | TTD (w/o fine-tuning) |
| Classification | CC3M-TagMask | Recall | 74.5 | TTD (w/o fine-tuning) |
| Classification | CC3M-TagMask | mAP | 90.3 | TTD (w/o fine-tuning) |
| Unsupervised Semantic Segmentation | COCO-Stuff-171 | mIoU | 23.7 | TTD (TCL) |
| Unsupervised Semantic Segmentation | COCO-Stuff-171 | mIoU | 19.4 | TTD (MaskCLIP) |
| Unsupervised Semantic Segmentation | COCO-Object | mIoU | 37.4 | TTD (TCL) |
| Unsupervised Semantic Segmentation | COCO-Object | mIoU | 26.5 | TTD (MaskCLIP) |
| Unsupervised Semantic Segmentation | ADE20K | Mean IoU (val) | 17 | TTD (TCL) |
| Unsupervised Semantic Segmentation | ADE20K | Mean IoU (val) | 12.7 | TTD (MaskCLIP) |
| Unsupervised Semantic Segmentation | Cityscapes val | mIoU | 32 | TTD (MaskCLIP) |
| Unsupervised Semantic Segmentation | Cityscapes val | mIoU | 27 | TTD (TCL) |
| Unsupervised Semantic Segmentation | PASCAL Context-59 | mIoU | 37.4 | TTD (TCL) |
| Unsupervised Semantic Segmentation | PASCAL Context-59 | mIoU | 31 | TTD (MaskCLIP) |
| Unsupervised Semantic Segmentation | PASCAL VOC | mIoU | 61.1 | TTD (TCL) |
| Unsupervised Semantic Segmentation | PASCAL VOC | mIoU | 43.1 | TTD (MaskCLIP) |
| Open Vocabulary Semantic Segmentation | COCO-Stuff-171 | mIoU | 23.7 | TTD (TCL) |
| Open Vocabulary Semantic Segmentation | COCO-Stuff-171 | mIoU | 19.4 | TTD (MaskCLIP) |
| Open Vocabulary Semantic Segmentation | Cityscapes | mIoU | 32 | TTD (TCL) |
| Open Vocabulary Semantic Segmentation | Cityscapes | mIoU | 27 | TTD (MaskCLIP) |
| Open Vocabulary Semantic Segmentation | PASCAL Context-59 | mIoU | 37.4 | TTD (TCL) |
| Open Vocabulary Semantic Segmentation | PASCAL Context-59 | mIoU | 31 | TTD (MaskCLIP) |
| Open Vocabulary Semantic Segmentation | ADE20K-150 | mIoU | 17 | TTD (TCL) |
| Open Vocabulary Semantic Segmentation | ADE20K-150 | mIoU | 12.7 | TTD (MaskCLIP) |
| 10-shot image generation | CC3M-TagMask | mIoU | 65.5 | TTD (TCL) |
| 10-shot image generation | CC3M-TagMask | mIoU | 50.2 | TTD (MaskCLIP) |
| 10-shot image generation | COCO-Stuff-171 | mIoU | 23.7 | TTD (TCL) |
| 10-shot image generation | COCO-Stuff-171 | mIoU | 19.4 | TTD (MaskCLIP) |
| 10-shot image generation | COCO-Object | mIoU | 37.4 | TTD (TCL) |
| 10-shot image generation | COCO-Object | mIoU | 26.5 | TTD (MaskCLIP) |
| 10-shot image generation | ADE20K | Mean IoU (val) | 17 | TTD (TCL) |
| 10-shot image generation | ADE20K | Mean IoU (val) | 12.7 | TTD (MaskCLIP) |
| 10-shot image generation | Cityscapes val | mIoU | 32 | TTD (MaskCLIP) |
| 10-shot image generation | Cityscapes val | mIoU | 27 | TTD (TCL) |
| 10-shot image generation | PASCAL Context-59 | mIoU | 37.4 | TTD (TCL) |
| 10-shot image generation | PASCAL Context-59 | mIoU | 31 | TTD (MaskCLIP) |
| 10-shot image generation | PASCAL VOC | mIoU | 61.1 | TTD (TCL) |
| 10-shot image generation | PASCAL VOC | mIoU | 43.1 | TTD (MaskCLIP) |