Yuheng Shi, Minjing Dong, Chang Xu
While Contrastive Language-Image Pre-training (CLIP) has advanced open-vocabulary predictions, its performance on semantic segmentation remains suboptimal. This shortfall primarily stems from its spatial-invariant semantic features and constrained resolution. While previous adaptations addressed spatial invariance semantic by modifying the self-attention in CLIP's image encoder, the issue of limited resolution remains unexplored. Different from previous segment-then-splice methods that segment sub-images via a sliding window and splice the results, we introduce a splice-then-segment paradigm that incorporates Segment-Anything Model (SAM) to tackle the resolution issue since SAM excels at extracting fine-grained semantic correlations from high-resolution images. Specifically, we introduce Trident, a training-free framework that first splices features extracted by CLIP and DINO from sub-images, then leverages SAM's encoder to create a correlation matrix for global aggregation, enabling a broadened receptive field for effective segmentation. Besides, we propose a refinement strategy for CLIP's coarse segmentation outputs by transforming them into prompts for SAM, further enhancing the segmentation performance. Trident achieves a significant improvement in the mIoU across eight benchmarks compared with the current SOTA, increasing from 44.4 to 48.6.Code is available at https://github.com/YuHengsss/Trident.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Semantic Segmentation | COCO-Stuff-171 | mIoU | 28.6 | Trident |
| Semantic Segmentation | COCO-Object | mIoU | 42.2 | Trident |
| Semantic Segmentation | ADE20K | Mean IoU (val) | 26.7 | Trident |
| Semantic Segmentation | Cityscapes val | mIoU | 47.6 | Trident |
| Semantic Segmentation | PASCAL Context-59 | mIoU | 44.3 | Trident |
| Semantic Segmentation | PASCAL Context-60 | mIoU | 40.1 | Trident |
| Semantic Segmentation | PascalVOC-20 | mIoU | 88.7 | Trident |
| Semantic Segmentation | PASCAL VOC | mIoU | 70.8 | Trident |
| Unsupervised Semantic Segmentation | COCO-Stuff-171 | mIoU | 28.6 | Trident |
| Unsupervised Semantic Segmentation | COCO-Object | mIoU | 42.2 | Trident |
| Unsupervised Semantic Segmentation | ADE20K | Mean IoU (val) | 26.7 | Trident |
| Unsupervised Semantic Segmentation | Cityscapes val | mIoU | 47.6 | Trident |
| Unsupervised Semantic Segmentation | PASCAL Context-59 | mIoU | 44.3 | Trident |
| Unsupervised Semantic Segmentation | PASCAL Context-60 | mIoU | 40.1 | Trident |
| Unsupervised Semantic Segmentation | PascalVOC-20 | mIoU | 88.7 | Trident |
| Unsupervised Semantic Segmentation | PASCAL VOC | mIoU | 70.8 | Trident |
| 10-shot image generation | COCO-Stuff-171 | mIoU | 28.6 | Trident |
| 10-shot image generation | COCO-Object | mIoU | 42.2 | Trident |
| 10-shot image generation | ADE20K | Mean IoU (val) | 26.7 | Trident |
| 10-shot image generation | Cityscapes val | mIoU | 47.6 | Trident |
| 10-shot image generation | PASCAL Context-59 | mIoU | 44.3 | Trident |
| 10-shot image generation | PASCAL Context-60 | mIoU | 40.1 | Trident |
| 10-shot image generation | PascalVOC-20 | mIoU | 88.7 | Trident |
| 10-shot image generation | PASCAL VOC | mIoU | 70.8 | Trident |