TextRegion: Text-Aligned Region Tokens from Frozen Image-Text Models

Yao Xiao, Qiqian Fu, Heyi Tao, Yuqun Wu, Zhen Zhu, Derek Hoiem

2025-05-29Unsupervised Semantic Segmentation with Language-image Pre-training Referring Expression Referring Expression Comprehension Semantic Segmentation

Paper PDF Code(official)

Abstract

Image-text models excel at image-level tasks but struggle with detailed visual understanding. While these models provide strong visual-language alignment, segmentation models like SAM2 offer precise spatial boundaries for objects. To this end, we propose TextRegion, a simple, effective, and training-free framework that combines the strengths of image-text models and SAM2 to generate powerful text-aligned region tokens. These tokens enable detailed visual understanding while preserving open-vocabulary capabilities. They can be directly applied to various downstream tasks, including open-world semantic segmentation, referring expression comprehension, and grounding. We conduct extensive evaluations and consistently achieve superior or competitive performance compared to state-of-the-art training-free methods. Additionally, our framework is compatible with many image-text models, making it highly practical and easily extensible as stronger models emerge. Code is available at: https://github.com/avaxiao/TextRegion.

Results

Task	Dataset	Metric	Value	Model
Semantic Segmentation	COCO-Stuff-171	mIoU	31.2	TextRegion
Semantic Segmentation	ADE20K	Mean IoU (val)	27.3	TextRegion
Semantic Segmentation	PASCAL Context-59	mIoU	46.1	TextRegion
Semantic Segmentation	PASCAL Context-60	mIoU	41.2	TextRegion
Semantic Segmentation	PascalVOC-20	mIoU	89.5	TextRegion
Semantic Segmentation	PASCAL VOC	mIoU	73.1	TextRegion
Unsupervised Semantic Segmentation	COCO-Stuff-171	mIoU	31.2	TextRegion
Unsupervised Semantic Segmentation	ADE20K	Mean IoU (val)	27.3	TextRegion
Unsupervised Semantic Segmentation	PASCAL Context-59	mIoU	46.1	TextRegion
Unsupervised Semantic Segmentation	PASCAL Context-60	mIoU	41.2	TextRegion
Unsupervised Semantic Segmentation	PascalVOC-20	mIoU	89.5	TextRegion
Unsupervised Semantic Segmentation	PASCAL VOC	mIoU	73.1	TextRegion
10-shot image generation	COCO-Stuff-171	mIoU	31.2	TextRegion
10-shot image generation	ADE20K	Mean IoU (val)	27.3	TextRegion
10-shot image generation	PASCAL Context-59	mIoU	46.1	TextRegion
10-shot image generation	PASCAL Context-60	mIoU	41.2	TextRegion
10-shot image generation	PascalVOC-20	mIoU	89.5	TextRegion
10-shot image generation	PASCAL VOC	mIoU	73.1	TextRegion

TextRegion: Text-Aligned Region Tokens from Frozen Image-Text Models

Abstract

Results

Related Papers

TextRegion: Text-Aligned Region Tokens from Frozen Image-Text Models

Abstract

Results

Related Papers