TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/TagAlign: Improving Vision-Language Alignment with Multi-T...

TagAlign: Improving Vision-Language Alignment with Multi-Tag Classification

Qinying Liu, Wei Wu, Kecheng Zheng, Zhan Tong, Jiawei Liu, Yu Liu, Wei Chen, Zilei Wang, Yujun Shen

2023-12-21Unsupervised Semantic Segmentation with Language-image Pre-trainingTAGAttributeOpen Vocabulary Semantic SegmentationSemantic Segmentation
PaperPDFCode(official)

Abstract

The crux of learning vision-language models is to extract semantically aligned information from visual and linguistic data. Existing attempts usually face the problem of coarse alignment, e.g., the vision encoder struggles in localizing an attribute-specified object. In this work, we propose an embarrassingly simple approach to better align image and text features with no need of additional data formats other than image-text pairs. Concretely, given an image and its paired text, we manage to parse objects (e.g., cat) and attributes (e.g., black) from the description, which are highly likely to exist in the image. It is noteworthy that the parsing pipeline is fully automatic and thus enjoys good scalability. With these parsed semantics as supervision signals, we can complement the commonly used image-text contrastive loss with the multi-tag classification loss. Extensive experimental results on a broad suite of semantic segmentation datasets substantiate the average 5.2\% improvement of our framework over existing alternatives. Furthermore, the visualization results indicate that attribute supervision makes vision-language models accurately localize attribute-specified objects. Project page can be found at https://qinying-liu.github.io/Tag-Align.

Results

TaskDatasetMetricValueModel
Semantic SegmentationCOCO-Stuff-171mIoU25.3TagAlign
Semantic SegmentationCOCO-ObjectmIoU33.3TagAlign
Semantic SegmentationADE20KMean IoU (val)17.3TagAlign
Semantic SegmentationCityscapes valmIoU27.5TagAlign
Semantic SegmentationPASCAL Context-59mIoU37.6TagAlign
Semantic SegmentationPascalVOC-20mIoU87.9TagAlign
Semantic SegmentationPASCAL VOCmIoU53.9TagAlign
Unsupervised Semantic SegmentationCOCO-Stuff-171mIoU25.3TagAlign
Unsupervised Semantic SegmentationCOCO-ObjectmIoU33.3TagAlign
Unsupervised Semantic SegmentationADE20KMean IoU (val)17.3TagAlign
Unsupervised Semantic SegmentationCityscapes valmIoU27.5TagAlign
Unsupervised Semantic SegmentationPASCAL Context-59mIoU37.6TagAlign
Unsupervised Semantic SegmentationPascalVOC-20mIoU87.9TagAlign
Unsupervised Semantic SegmentationPASCAL VOCmIoU53.9TagAlign
Open Vocabulary Semantic SegmentationPascalVOC-20mIoU87.9TagAlign(trained with image-text pairs)
Open Vocabulary Semantic SegmentationPASCAL Context-59mIoU37.6TaAlign(trained with image-text pairs)
10-shot image generationCOCO-Stuff-171mIoU25.3TagAlign
10-shot image generationCOCO-ObjectmIoU33.3TagAlign
10-shot image generationADE20KMean IoU (val)17.3TagAlign
10-shot image generationCityscapes valmIoU27.5TagAlign
10-shot image generationPASCAL Context-59mIoU37.6TagAlign
10-shot image generationPascalVOC-20mIoU87.9TagAlign
10-shot image generationPASCAL VOCmIoU53.9TagAlign

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21DiffOSeg: Omni Medical Image Segmentation via Multi-Expert Collaboration Diffusion Model2025-07-17SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation2025-07-17Unified Medical Image Segmentation with State Space Modeling Snake2025-07-17A Privacy-Preserving Semantic-Segmentation Method Using Domain-Adaptation Technique2025-07-17MGFFD-VLM: Multi-Granularity Prompt Learning for Face Forgery Detection with VLM2025-07-16Non-Adaptive Adversarial Face Generation2025-07-16SAMST: A Transformer framework based on SAM pseudo label filtering for remote sensing semi-supervised semantic segmentation2025-07-16