TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/UMG-CLIP: A Unified Multi-Granularity Vision Generalist fo...

UMG-CLIP: A Unified Multi-Granularity Vision Generalist for Open-World Understanding

Bowen Shi, Peisen Zhao, Zichen Wang, Yuhang Zhang, Yaoming Wang, Jin Li, Wenrui Dai, Junni Zou, Hongkai Xiong, Qi Tian, Xiaopeng Zhang

2024-01-12Panoptic SegmentationOpen Vocabulary Semantic SegmentationSegmentationSemantic SegmentationOpen Vocabulary Panoptic SegmentationRetrieval
PaperPDFCode(official)

Abstract

Vision-language foundation models, represented by Contrastive Language-Image Pre-training (CLIP), have gained increasing attention for jointly understanding both vision and textual tasks. However, existing approaches primarily focus on training models to match global image representations with textual descriptions, thereby overlooking the critical alignment between local regions and corresponding text tokens. This paper extends CLIP with multi-granularity alignment. Notably, we deliberately construct a new dataset comprising pseudo annotations at various levels of granularities, encompassing image-level, region-level as well as pixel-level captions and tags. Accordingly, we develop a Unified Multi-Granularity learning framework, termed UMG-CLIP, which simultaneously empowers the model with versatile perception abilities across different levels of detail. With parameter efficient tuning, UMG-CLIP surpasses current widely used CLIP variants and achieves state-of-the-art performance on diverse image understanding benchmarks, including open-world recognition, retrieval, semantic segmentation, and panoptic segmentation tasks. We believe that UMG-CLIP represents a valuable advancement in vision-language foundation models. The code is available at https://github.com/lygsbw/UMG-CLIP.

Results

TaskDatasetMetricValueModel
Semantic SegmentationCOCO minivalAP50.7UMG-CLIP-E/14
Semantic SegmentationCOCO minivalPQ59.5UMG-CLIP-E/14
Semantic SegmentationCOCO minivalmIoU69.7UMG-CLIP-E/14
Semantic SegmentationCOCO minivalAP49.7UMG-CLIP-L/14
Semantic SegmentationCOCO minivalPQ58.9UMG-CLIP-L/14
Semantic SegmentationCOCO minivalmIoU68.9UMG-CLIP-L/14
Open Vocabulary Panoptic SegmentationADE20KPQ31.6UMG-CLIP-E/14
Open Vocabulary Panoptic SegmentationADE20KPQ29.1UMG-CLIP-L/14
Open Vocabulary Semantic SegmentationADE20K-847mIoU17.3UMG-CLIP-E/14
Open Vocabulary Semantic SegmentationADE20K-847mIoU15.4UMG-CLIP-L/14
Open Vocabulary Semantic SegmentationPascalVOC-20bmIoU85.4UMG-CLIP-E/14
Open Vocabulary Semantic SegmentationPASCAL Context-459mIoU25.2UMG-CLIP-E/14
Open Vocabulary Semantic SegmentationPASCAL Context-459mIoU23.2UMG-CLIP-L/14
Open Vocabulary Semantic SegmentationPascalVOC-20mIoU97.9UMG-CLIP-L/14
Open Vocabulary Semantic SegmentationPASCAL Context-59mIoU61UMG-CLIP-L/14
Open Vocabulary Semantic SegmentationADE20K-150mIoU38.2UMG-CLIP-E/14
Open Vocabulary Semantic SegmentationADE20K-150mIoU36.1UMG-CLIP-L/14
10-shot image generationCOCO minivalAP50.7UMG-CLIP-E/14
10-shot image generationCOCO minivalPQ59.5UMG-CLIP-E/14
10-shot image generationCOCO minivalmIoU69.7UMG-CLIP-E/14
10-shot image generationCOCO minivalAP49.7UMG-CLIP-L/14
10-shot image generationCOCO minivalPQ58.9UMG-CLIP-L/14
10-shot image generationCOCO minivalmIoU68.9UMG-CLIP-L/14
Panoptic SegmentationCOCO minivalAP50.7UMG-CLIP-E/14
Panoptic SegmentationCOCO minivalPQ59.5UMG-CLIP-E/14
Panoptic SegmentationCOCO minivalmIoU69.7UMG-CLIP-E/14
Panoptic SegmentationCOCO minivalAP49.7UMG-CLIP-L/14
Panoptic SegmentationCOCO minivalPQ58.9UMG-CLIP-L/14
Panoptic SegmentationCOCO minivalmIoU68.9UMG-CLIP-L/14

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21Deep Learning-Based Fetal Lung Segmentation from Diffusion-weighted MRI Images and Lung Maturity Evaluation for Fetal Growth Restriction2025-07-17DiffOSeg: Omni Medical Image Segmentation via Multi-Expert Collaboration Diffusion Model2025-07-17From Variability To Accuracy: Conditional Bernoulli Diffusion Models with Consensus-Driven Correction for Thin Structure Segmentation2025-07-17Unleashing Vision Foundation Models for Coronary Artery Segmentation: Parallel ViT-CNN Encoding and Variational Fusion2025-07-17SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation2025-07-17Unified Medical Image Segmentation with State Space Modeling Snake2025-07-17A Privacy-Preserving Semantic-Segmentation Method Using Domain-Adaptation Technique2025-07-17