TagAlign: Improving Vision-Language Alignment with Multi-Tag Classification

Qinying Liu, Wei Wu, Kecheng Zheng, Zhan Tong, Jiawei Liu, Yu Liu, Wei Chen, Zilei Wang, Yujun Shen

2023-12-21Unsupervised Semantic Segmentation with Language-image Pre-training TAG Attribute Open Vocabulary Semantic Segmentation Semantic Segmentation

Paper PDF Code(official)

Abstract

The crux of learning vision-language models is to extract semantically aligned information from visual and linguistic data. Existing attempts usually face the problem of coarse alignment, e.g., the vision encoder struggles in localizing an attribute-specified object. In this work, we propose an embarrassingly simple approach to better align image and text features with no need of additional data formats other than image-text pairs. Concretely, given an image and its paired text, we manage to parse objects (e.g., cat) and attributes (e.g., black) from the description, which are highly likely to exist in the image. It is noteworthy that the parsing pipeline is fully automatic and thus enjoys good scalability. With these parsed semantics as supervision signals, we can complement the commonly used image-text contrastive loss with the multi-tag classification loss. Extensive experimental results on a broad suite of semantic segmentation datasets substantiate the average 5.2\% improvement of our framework over existing alternatives. Furthermore, the visualization results indicate that attribute supervision makes vision-language models accurately localize attribute-specified objects. Project page can be found at https://qinying-liu.github.io/Tag-Align.

Results

Task	Dataset	Metric	Value	Model
Semantic Segmentation	COCO-Stuff-171	mIoU	25.3	TagAlign
Semantic Segmentation	COCO-Object	mIoU	33.3	TagAlign
Semantic Segmentation	ADE20K	Mean IoU (val)	17.3	TagAlign
Semantic Segmentation	Cityscapes val	mIoU	27.5	TagAlign
Semantic Segmentation	PASCAL Context-59	mIoU	37.6	TagAlign
Semantic Segmentation	PascalVOC-20	mIoU	87.9	TagAlign
Semantic Segmentation	PASCAL VOC	mIoU	53.9	TagAlign
Unsupervised Semantic Segmentation	COCO-Stuff-171	mIoU	25.3	TagAlign
Unsupervised Semantic Segmentation	COCO-Object	mIoU	33.3	TagAlign
Unsupervised Semantic Segmentation	ADE20K	Mean IoU (val)	17.3	TagAlign
Unsupervised Semantic Segmentation	Cityscapes val	mIoU	27.5	TagAlign
Unsupervised Semantic Segmentation	PASCAL Context-59	mIoU	37.6	TagAlign
Unsupervised Semantic Segmentation	PascalVOC-20	mIoU	87.9	TagAlign
Unsupervised Semantic Segmentation	PASCAL VOC	mIoU	53.9	TagAlign
Open Vocabulary Semantic Segmentation	PascalVOC-20	mIoU	87.9	TagAlign(trained with image-text pairs)
Open Vocabulary Semantic Segmentation	PASCAL Context-59	mIoU	37.6	TaAlign(trained with image-text pairs)
10-shot image generation	COCO-Stuff-171	mIoU	25.3	TagAlign
10-shot image generation	COCO-Object	mIoU	33.3	TagAlign
10-shot image generation	ADE20K	Mean IoU (val)	17.3	TagAlign
10-shot image generation	Cityscapes val	mIoU	27.5	TagAlign
10-shot image generation	PASCAL Context-59	mIoU	37.6	TagAlign
10-shot image generation	PascalVOC-20	mIoU	87.9	TagAlign
10-shot image generation	PASCAL VOC	mIoU	53.9	TagAlign

TagAlign: Improving Vision-Language Alignment with Multi-Tag Classification

Abstract

Results

Related Papers

TagAlign: Improving Vision-Language Alignment with Multi-Tag Classification

Abstract

Results

Related Papers