From Text to Mask: Localizing Entities Using the Attention of Text-to-Image Diffusion Models

Changming Xiao, Qi Yang, Feng Zhou, ChangShui Zhang

2023-09-08Denoising Weakly-Supervised Semantic Segmentation Text-to-Image Generation Weakly supervised Semantic Segmentation Segmentation Text to Image Generation Semantic Segmentation Image Generation Image Segmentation

Paper PDF Code(official)

Abstract

Diffusion models have revolted the field of text-to-image generation recently. The unique way of fusing text and image information contributes to their remarkable capability of generating highly text-related images. From another perspective, these generative models imply clues about the precise correlation between words and pixels. In this work, a simple but effective method is proposed to utilize the attention mechanism in the denoising network of text-to-image diffusion models. Without re-training nor inference-time optimization, the semantic grounding of phrases can be attained directly. We evaluate our method on Pascal VOC 2012 and Microsoft COCO 2014 under weakly-supervised semantic segmentation setting and our method achieves superior performance to prior methods. In addition, the acquired word-pixel correlation is found to be generalizable for the learned text embedding of customized generation methods, requiring only a few modifications. To validate our discovery, we introduce a new practical task called "personalized referring image segmentation" with a new dataset. Experiments in various situations demonstrate the advantages of our method compared to strong baselines on this task. In summary, our work reveals a novel way to extract the rich multi-modal knowledge hidden in diffusion models for segmentation.

Results

Task	Dataset	Metric	Value	Model
Semantic Segmentation	COCO 2014 val	mIoU	45.7	T2MDiffusion(DeepLabV2-ResNet101)
Semantic Segmentation	PASCAL VOC 2012 val	Mean IoU	73.3	T2MDiffusion(DeepLabV2-ResNet101)
Semantic Segmentation	PASCAL VOC 2012 test	Mean IoU	74.2	T2MDiffusion(DeepLabV2-ResNet101)
10-shot image generation	COCO 2014 val	mIoU	45.7	T2MDiffusion(DeepLabV2-ResNet101)
10-shot image generation	PASCAL VOC 2012 val	Mean IoU	73.3	T2MDiffusion(DeepLabV2-ResNet101)
10-shot image generation	PASCAL VOC 2012 test	Mean IoU	74.2	T2MDiffusion(DeepLabV2-ResNet101)

From Text to Mask: Localizing Entities Using the Attention of Text-to-Image Diffusion Models

Abstract

Results

Related Papers

From Text to Mask: Localizing Entities Using the Attention of Text-to-Image Diffusion Models

Abstract

Results

Related Papers