Miguel Espinosa, Chenhongyi Yang, Linus Ericsson, Steven McDonagh, Elliot J. Crowley
The performance of image segmentation models has historically been constrained by the high cost of collecting large-scale annotated data. The Segment Anything Model (SAM) alleviates this original problem through a promptable, semantics-agnostic, segmentation paradigm and yet still requires manual visual-prompts or complex domain-dependent prompt-generation rules to process a new image. Towards reducing this new burden, our work investigates the task of object segmentation when provided with, alternatively, only a small set of reference images. Our key insight is to leverage strong semantic priors, as learned by foundation models, to identify corresponding regions between a reference and a target image. We find that correspondences enable automatic generation of instance-level segmentation masks for downstream tasks and instantiate our ideas via a multi-stage, training-free method incorporating (1) memory bank construction; (2) representation aggregation and (3) semantic-aware feature matching. Our experiments show significant improvements on segmentation metrics, leading to state-of-the-art performance on COCO FSOD (36.8% nAP), PASCAL VOC Few-Shot (71.2% nAP50) and outperforming existing training-free approaches on the Cross-Domain FSOD benchmark (22.4% nAP).
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Object Detection | MS-COCO (1-shot) | AP | 26.5 | Training-free |
| Object Detection | MS-COCO (30-shot) | AP | 36.8 | Training-free |
| Object Detection | MS-COCO (10-shot) | AP | 36.6 | Training-free |
| Object Detection | Artaxor | mAP | 35 | Training-free(w/o FT) |
| Object Detection | NEU-DET | mAP | 5.5 | Training-free(w/o FT) |
| Object Detection | DIOR | mAP | 16.4 | Training-free(w/o FT) |
| Object Detection | Clipark1k | mAP | 25.9 | Training-free(w/o FT) |
| Object Detection | DeepFish | mAP | 29.6 | Training-free(w/o FT) |
| Object Detection | UODD | mAP | 16 | Training-free(w/o FT) |
| 3D | MS-COCO (1-shot) | AP | 26.5 | Training-free |
| 3D | MS-COCO (30-shot) | AP | 36.8 | Training-free |
| 3D | MS-COCO (10-shot) | AP | 36.6 | Training-free |
| 3D | Artaxor | mAP | 35 | Training-free(w/o FT) |
| 3D | NEU-DET | mAP | 5.5 | Training-free(w/o FT) |
| 3D | DIOR | mAP | 16.4 | Training-free(w/o FT) |
| 3D | Clipark1k | mAP | 25.9 | Training-free(w/o FT) |
| 3D | DeepFish | mAP | 29.6 | Training-free(w/o FT) |
| 3D | UODD | mAP | 16 | Training-free(w/o FT) |
| Few-Shot Object Detection | MS-COCO (1-shot) | AP | 26.5 | Training-free |
| Few-Shot Object Detection | MS-COCO (30-shot) | AP | 36.8 | Training-free |
| Few-Shot Object Detection | MS-COCO (10-shot) | AP | 36.6 | Training-free |
| Few-Shot Object Detection | Artaxor | mAP | 35 | Training-free(w/o FT) |
| Few-Shot Object Detection | NEU-DET | mAP | 5.5 | Training-free(w/o FT) |
| Few-Shot Object Detection | DIOR | mAP | 16.4 | Training-free(w/o FT) |
| Few-Shot Object Detection | Clipark1k | mAP | 25.9 | Training-free(w/o FT) |
| Few-Shot Object Detection | DeepFish | mAP | 29.6 | Training-free(w/o FT) |
| Few-Shot Object Detection | UODD | mAP | 16 | Training-free(w/o FT) |
| 2D Classification | MS-COCO (1-shot) | AP | 26.5 | Training-free |
| 2D Classification | MS-COCO (30-shot) | AP | 36.8 | Training-free |
| 2D Classification | MS-COCO (10-shot) | AP | 36.6 | Training-free |
| 2D Classification | Artaxor | mAP | 35 | Training-free(w/o FT) |
| 2D Classification | NEU-DET | mAP | 5.5 | Training-free(w/o FT) |
| 2D Classification | DIOR | mAP | 16.4 | Training-free(w/o FT) |
| 2D Classification | Clipark1k | mAP | 25.9 | Training-free(w/o FT) |
| 2D Classification | DeepFish | mAP | 29.6 | Training-free(w/o FT) |
| 2D Classification | UODD | mAP | 16 | Training-free(w/o FT) |
| 2D Object Detection | MS-COCO (1-shot) | AP | 26.5 | Training-free |
| 2D Object Detection | MS-COCO (30-shot) | AP | 36.8 | Training-free |
| 2D Object Detection | MS-COCO (10-shot) | AP | 36.6 | Training-free |
| 2D Object Detection | Artaxor | mAP | 35 | Training-free(w/o FT) |
| 2D Object Detection | NEU-DET | mAP | 5.5 | Training-free(w/o FT) |
| 2D Object Detection | DIOR | mAP | 16.4 | Training-free(w/o FT) |
| 2D Object Detection | Clipark1k | mAP | 25.9 | Training-free(w/o FT) |
| 2D Object Detection | DeepFish | mAP | 29.6 | Training-free(w/o FT) |
| 2D Object Detection | UODD | mAP | 16 | Training-free(w/o FT) |
| 16k | MS-COCO (1-shot) | AP | 26.5 | Training-free |
| 16k | MS-COCO (30-shot) | AP | 36.8 | Training-free |
| 16k | MS-COCO (10-shot) | AP | 36.6 | Training-free |
| 16k | Artaxor | mAP | 35 | Training-free(w/o FT) |
| 16k | NEU-DET | mAP | 5.5 | Training-free(w/o FT) |
| 16k | DIOR | mAP | 16.4 | Training-free(w/o FT) |
| 16k | Clipark1k | mAP | 25.9 | Training-free(w/o FT) |
| 16k | DeepFish | mAP | 29.6 | Training-free(w/o FT) |
| 16k | UODD | mAP | 16 | Training-free(w/o FT) |