TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/TTD: Text-Tag Self-Distillation Enhancing Image-Text Align...

TTD: Text-Tag Self-Distillation Enhancing Image-Text Alignment in CLIP to Alleviate Single Tag Bias

Sanghyun Jo, Soohyun Ryu, Sungyub Kim, Eunho Yang, KyungSu Kim

2024-03-30Unsupervised Semantic Segmentation with Language-image Pre-trainingTAGOpen Vocabulary Semantic SegmentationSemantic SegmentationMulti-Label Text Classification
PaperPDFCode(official)

Abstract

We identify a critical bias in contemporary CLIP-based models, which we denote as single tag bias. This bias manifests as a disproportionate focus on a singular tag (word) while neglecting other pertinent tags, stemming from CLIP's text embeddings that prioritize one specific tag in image-text relationships. When deconstructing text into individual tags, only one tag tends to have high relevancy with CLIP's image embedding, leading to biased tag relevancy. In this paper, we introduce a novel two-step fine-tuning approach, Text-Tag Self-Distillation (TTD), to address this challenge. TTD first extracts image-relevant tags from text based on their similarity to the nearest pixels then employs a self-distillation strategy to align combined masks with the text-derived mask. This approach ensures the unbiased image-text alignment of the CLIP-based models using only image-text pairs without necessitating additional supervision. Our technique demonstrates model-agnostic improvements in multi-tag classification and segmentation tasks, surpassing competing methods that rely on external resources. The code is available at https://github.com/shjo-april/TTD.

Results

TaskDatasetMetricValueModel
Semantic SegmentationCC3M-TagMaskmIoU65.5TTD (TCL)
Semantic SegmentationCC3M-TagMaskmIoU50.2TTD (MaskCLIP)
Semantic SegmentationCOCO-Stuff-171mIoU23.7TTD (TCL)
Semantic SegmentationCOCO-Stuff-171mIoU19.4TTD (MaskCLIP)
Semantic SegmentationCOCO-ObjectmIoU37.4TTD (TCL)
Semantic SegmentationCOCO-ObjectmIoU26.5TTD (MaskCLIP)
Semantic SegmentationADE20KMean IoU (val)17TTD (TCL)
Semantic SegmentationADE20KMean IoU (val)12.7TTD (MaskCLIP)
Semantic SegmentationCityscapes valmIoU32TTD (MaskCLIP)
Semantic SegmentationCityscapes valmIoU27TTD (TCL)
Semantic SegmentationPASCAL Context-59mIoU37.4TTD (TCL)
Semantic SegmentationPASCAL Context-59mIoU31TTD (MaskCLIP)
Semantic SegmentationPASCAL VOCmIoU61.1TTD (TCL)
Semantic SegmentationPASCAL VOCmIoU43.1TTD (MaskCLIP)
Multi-Label Text ClassificationCC3M-TagMaskAccuracy88.6TTD (w/ fine-tuning)
Multi-Label Text ClassificationCC3M-TagMaskF182.8TTD (w/ fine-tuning)
Multi-Label Text ClassificationCC3M-TagMaskPrecision88.3TTD (w/ fine-tuning)
Multi-Label Text ClassificationCC3M-TagMaskRecall78TTD (w/ fine-tuning)
Multi-Label Text ClassificationCC3M-TagMaskmAP93.7TTD (w/ fine-tuning)
Multi-Label Text ClassificationCC3M-TagMaskAccuracy91TTD (w/o fine-tuning)
Multi-Label Text ClassificationCC3M-TagMaskF178.5TTD (w/o fine-tuning)
Multi-Label Text ClassificationCC3M-TagMaskPrecision82.9TTD (w/o fine-tuning)
Multi-Label Text ClassificationCC3M-TagMaskRecall74.5TTD (w/o fine-tuning)
Multi-Label Text ClassificationCC3M-TagMaskmAP90.3TTD (w/o fine-tuning)
Text ClassificationCC3M-TagMaskAccuracy88.6TTD (w/ fine-tuning)
Text ClassificationCC3M-TagMaskF182.8TTD (w/ fine-tuning)
Text ClassificationCC3M-TagMaskPrecision88.3TTD (w/ fine-tuning)
Text ClassificationCC3M-TagMaskRecall78TTD (w/ fine-tuning)
Text ClassificationCC3M-TagMaskmAP93.7TTD (w/ fine-tuning)
Text ClassificationCC3M-TagMaskAccuracy91TTD (w/o fine-tuning)
Text ClassificationCC3M-TagMaskF178.5TTD (w/o fine-tuning)
Text ClassificationCC3M-TagMaskPrecision82.9TTD (w/o fine-tuning)
Text ClassificationCC3M-TagMaskRecall74.5TTD (w/o fine-tuning)
Text ClassificationCC3M-TagMaskmAP90.3TTD (w/o fine-tuning)
ClassificationCC3M-TagMaskAccuracy88.6TTD (w/ fine-tuning)
ClassificationCC3M-TagMaskF182.8TTD (w/ fine-tuning)
ClassificationCC3M-TagMaskPrecision88.3TTD (w/ fine-tuning)
ClassificationCC3M-TagMaskRecall78TTD (w/ fine-tuning)
ClassificationCC3M-TagMaskmAP93.7TTD (w/ fine-tuning)
ClassificationCC3M-TagMaskAccuracy91TTD (w/o fine-tuning)
ClassificationCC3M-TagMaskF178.5TTD (w/o fine-tuning)
ClassificationCC3M-TagMaskPrecision82.9TTD (w/o fine-tuning)
ClassificationCC3M-TagMaskRecall74.5TTD (w/o fine-tuning)
ClassificationCC3M-TagMaskmAP90.3TTD (w/o fine-tuning)
Unsupervised Semantic SegmentationCOCO-Stuff-171mIoU23.7TTD (TCL)
Unsupervised Semantic SegmentationCOCO-Stuff-171mIoU19.4TTD (MaskCLIP)
Unsupervised Semantic SegmentationCOCO-ObjectmIoU37.4TTD (TCL)
Unsupervised Semantic SegmentationCOCO-ObjectmIoU26.5TTD (MaskCLIP)
Unsupervised Semantic SegmentationADE20KMean IoU (val)17TTD (TCL)
Unsupervised Semantic SegmentationADE20KMean IoU (val)12.7TTD (MaskCLIP)
Unsupervised Semantic SegmentationCityscapes valmIoU32TTD (MaskCLIP)
Unsupervised Semantic SegmentationCityscapes valmIoU27TTD (TCL)
Unsupervised Semantic SegmentationPASCAL Context-59mIoU37.4TTD (TCL)
Unsupervised Semantic SegmentationPASCAL Context-59mIoU31TTD (MaskCLIP)
Unsupervised Semantic SegmentationPASCAL VOCmIoU61.1TTD (TCL)
Unsupervised Semantic SegmentationPASCAL VOCmIoU43.1TTD (MaskCLIP)
Open Vocabulary Semantic SegmentationCOCO-Stuff-171mIoU23.7TTD (TCL)
Open Vocabulary Semantic SegmentationCOCO-Stuff-171mIoU19.4TTD (MaskCLIP)
Open Vocabulary Semantic SegmentationCityscapesmIoU32TTD (TCL)
Open Vocabulary Semantic SegmentationCityscapesmIoU27TTD (MaskCLIP)
Open Vocabulary Semantic SegmentationPASCAL Context-59mIoU37.4TTD (TCL)
Open Vocabulary Semantic SegmentationPASCAL Context-59mIoU31TTD (MaskCLIP)
Open Vocabulary Semantic SegmentationADE20K-150mIoU17TTD (TCL)
Open Vocabulary Semantic SegmentationADE20K-150mIoU12.7TTD (MaskCLIP)
10-shot image generationCC3M-TagMaskmIoU65.5TTD (TCL)
10-shot image generationCC3M-TagMaskmIoU50.2TTD (MaskCLIP)
10-shot image generationCOCO-Stuff-171mIoU23.7TTD (TCL)
10-shot image generationCOCO-Stuff-171mIoU19.4TTD (MaskCLIP)
10-shot image generationCOCO-ObjectmIoU37.4TTD (TCL)
10-shot image generationCOCO-ObjectmIoU26.5TTD (MaskCLIP)
10-shot image generationADE20KMean IoU (val)17TTD (TCL)
10-shot image generationADE20KMean IoU (val)12.7TTD (MaskCLIP)
10-shot image generationCityscapes valmIoU32TTD (MaskCLIP)
10-shot image generationCityscapes valmIoU27TTD (TCL)
10-shot image generationPASCAL Context-59mIoU37.4TTD (TCL)
10-shot image generationPASCAL Context-59mIoU31TTD (MaskCLIP)
10-shot image generationPASCAL VOCmIoU61.1TTD (TCL)
10-shot image generationPASCAL VOCmIoU43.1TTD (MaskCLIP)

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21DiffOSeg: Omni Medical Image Segmentation via Multi-Expert Collaboration Diffusion Model2025-07-17SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation2025-07-17Unified Medical Image Segmentation with State Space Modeling Snake2025-07-17A Privacy-Preserving Semantic-Segmentation Method Using Domain-Adaptation Technique2025-07-17SAMST: A Transformer framework based on SAM pseudo label filtering for remote sensing semi-supervised semantic segmentation2025-07-16Personalized OVSS: Understanding Personal Concept in Open-Vocabulary Semantic Segmentation2025-07-15Tomato Multi-Angle Multi-Pose Dataset for Fine-Grained Phenotyping2025-07-15