UMG-CLIP: A Unified Multi-Granularity Vision Generalist for Open-World Understanding

Bowen Shi, Peisen Zhao, Zichen Wang, Yuhang Zhang, Yaoming Wang, Jin Li, Wenrui Dai, Junni Zou, Hongkai Xiong, Qi Tian, Xiaopeng Zhang

2024-01-12Panoptic Segmentation Open Vocabulary Semantic Segmentation Segmentation Semantic Segmentation Open Vocabulary Panoptic Segmentation Retrieval

Paper PDF Code(official)

Abstract

Vision-language foundation models, represented by Contrastive Language-Image Pre-training (CLIP), have gained increasing attention for jointly understanding both vision and textual tasks. However, existing approaches primarily focus on training models to match global image representations with textual descriptions, thereby overlooking the critical alignment between local regions and corresponding text tokens. This paper extends CLIP with multi-granularity alignment. Notably, we deliberately construct a new dataset comprising pseudo annotations at various levels of granularities, encompassing image-level, region-level as well as pixel-level captions and tags. Accordingly, we develop a Unified Multi-Granularity learning framework, termed UMG-CLIP, which simultaneously empowers the model with versatile perception abilities across different levels of detail. With parameter efficient tuning, UMG-CLIP surpasses current widely used CLIP variants and achieves state-of-the-art performance on diverse image understanding benchmarks, including open-world recognition, retrieval, semantic segmentation, and panoptic segmentation tasks. We believe that UMG-CLIP represents a valuable advancement in vision-language foundation models. The code is available at https://github.com/lygsbw/UMG-CLIP.

Results

Task	Dataset	Metric	Value	Model
Semantic Segmentation	COCO minival	AP	50.7	UMG-CLIP-E/14
Semantic Segmentation	COCO minival	PQ	59.5	UMG-CLIP-E/14
Semantic Segmentation	COCO minival	mIoU	69.7	UMG-CLIP-E/14
Semantic Segmentation	COCO minival	AP	49.7	UMG-CLIP-L/14
Semantic Segmentation	COCO minival	PQ	58.9	UMG-CLIP-L/14
Semantic Segmentation	COCO minival	mIoU	68.9	UMG-CLIP-L/14
Open Vocabulary Panoptic Segmentation	ADE20K	PQ	31.6	UMG-CLIP-E/14
Open Vocabulary Panoptic Segmentation	ADE20K	PQ	29.1	UMG-CLIP-L/14
Open Vocabulary Semantic Segmentation	ADE20K-847	mIoU	17.3	UMG-CLIP-E/14
Open Vocabulary Semantic Segmentation	ADE20K-847	mIoU	15.4	UMG-CLIP-L/14
Open Vocabulary Semantic Segmentation	PascalVOC-20b	mIoU	85.4	UMG-CLIP-E/14
Open Vocabulary Semantic Segmentation	PASCAL Context-459	mIoU	25.2	UMG-CLIP-E/14
Open Vocabulary Semantic Segmentation	PASCAL Context-459	mIoU	23.2	UMG-CLIP-L/14
Open Vocabulary Semantic Segmentation	PascalVOC-20	mIoU	97.9	UMG-CLIP-L/14
Open Vocabulary Semantic Segmentation	PASCAL Context-59	mIoU	61	UMG-CLIP-L/14
Open Vocabulary Semantic Segmentation	ADE20K-150	mIoU	38.2	UMG-CLIP-E/14
Open Vocabulary Semantic Segmentation	ADE20K-150	mIoU	36.1	UMG-CLIP-L/14
10-shot image generation	COCO minival	AP	50.7	UMG-CLIP-E/14
10-shot image generation	COCO minival	PQ	59.5	UMG-CLIP-E/14
10-shot image generation	COCO minival	mIoU	69.7	UMG-CLIP-E/14
10-shot image generation	COCO minival	AP	49.7	UMG-CLIP-L/14
10-shot image generation	COCO minival	PQ	58.9	UMG-CLIP-L/14
10-shot image generation	COCO minival	mIoU	68.9	UMG-CLIP-L/14
Panoptic Segmentation	COCO minival	AP	50.7	UMG-CLIP-E/14
Panoptic Segmentation	COCO minival	PQ	59.5	UMG-CLIP-E/14
Panoptic Segmentation	COCO minival	mIoU	69.7	UMG-CLIP-E/14
Panoptic Segmentation	COCO minival	AP	49.7	UMG-CLIP-L/14
Panoptic Segmentation	COCO minival	PQ	58.9	UMG-CLIP-L/14
Panoptic Segmentation	COCO minival	mIoU	68.9	UMG-CLIP-L/14

UMG-CLIP: A Unified Multi-Granularity Vision Generalist for Open-World Understanding

Abstract

Results

Related Papers

UMG-CLIP: A Unified Multi-Granularity Vision Generalist for Open-World Understanding

Abstract

Results

Related Papers