TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/From Text to Mask: Localizing Entities Using the Attention...

From Text to Mask: Localizing Entities Using the Attention of Text-to-Image Diffusion Models

Changming Xiao, Qi Yang, Feng Zhou, ChangShui Zhang

2023-09-08DenoisingWeakly-Supervised Semantic SegmentationText-to-Image GenerationWeakly supervised Semantic SegmentationSegmentationText to Image GenerationSemantic SegmentationImage GenerationImage Segmentation
PaperPDFCode(official)

Abstract

Diffusion models have revolted the field of text-to-image generation recently. The unique way of fusing text and image information contributes to their remarkable capability of generating highly text-related images. From another perspective, these generative models imply clues about the precise correlation between words and pixels. In this work, a simple but effective method is proposed to utilize the attention mechanism in the denoising network of text-to-image diffusion models. Without re-training nor inference-time optimization, the semantic grounding of phrases can be attained directly. We evaluate our method on Pascal VOC 2012 and Microsoft COCO 2014 under weakly-supervised semantic segmentation setting and our method achieves superior performance to prior methods. In addition, the acquired word-pixel correlation is found to be generalizable for the learned text embedding of customized generation methods, requiring only a few modifications. To validate our discovery, we introduce a new practical task called "personalized referring image segmentation" with a new dataset. Experiments in various situations demonstrate the advantages of our method compared to strong baselines on this task. In summary, our work reveals a novel way to extract the rich multi-modal knowledge hidden in diffusion models for segmentation.

Results

TaskDatasetMetricValueModel
Semantic SegmentationCOCO 2014 valmIoU45.7T2MDiffusion(DeepLabV2-ResNet101)
Semantic SegmentationPASCAL VOC 2012 valMean IoU73.3T2MDiffusion(DeepLabV2-ResNet101)
Semantic SegmentationPASCAL VOC 2012 testMean IoU74.2T2MDiffusion(DeepLabV2-ResNet101)
10-shot image generationCOCO 2014 valmIoU45.7T2MDiffusion(DeepLabV2-ResNet101)
10-shot image generationPASCAL VOC 2012 valMean IoU73.3T2MDiffusion(DeepLabV2-ResNet101)
10-shot image generationPASCAL VOC 2012 testMean IoU74.2T2MDiffusion(DeepLabV2-ResNet101)

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21fastWDM3D: Fast and Accurate 3D Healthy Tissue Inpainting2025-07-17Diffuman4D: 4D Consistent Human View Synthesis from Sparse-View Videos with Spatio-Temporal Diffusion Models2025-07-17Deep Learning-Based Fetal Lung Segmentation from Diffusion-weighted MRI Images and Lung Maturity Evaluation for Fetal Growth Restriction2025-07-17DiffOSeg: Omni Medical Image Segmentation via Multi-Expert Collaboration Diffusion Model2025-07-17From Variability To Accuracy: Conditional Bernoulli Diffusion Models with Consensus-Driven Correction for Thin Structure Segmentation2025-07-17Unleashing Vision Foundation Models for Coronary Artery Segmentation: Parallel ViT-CNN Encoding and Variational Fusion2025-07-17SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation2025-07-17