Paul Engstler, Luke Melas-Kyriazi, Christian Rupprecht, Iro Laina
Self-supervised learning (SSL) can be used to solve complex visual tasks without human labels. Self-supervised representations encode useful semantic information about images, and as a result, they have already been used for tasks such as unsupervised semantic segmentation. In this paper, we investigate self-supervised representations for instance segmentation without any manual annotations. We find that the features of different SSL methods vary in their level of instance-awareness. In particular, DINO features, which are known to be excellent semantic descriptors, lack behind MAE features in their sensitivity for separating instances.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Unsupervised Instance Segmentation | COCO val2017 | AP | 5.2 | Self-Training (MAE) |
| Unsupervised Instance Segmentation | COCO val2017 | AP50 | 12.1 | Self-Training (MAE) |
| Unsupervised Instance Segmentation | COCO val2017 | AP75 | 3.7 | Self-Training (MAE) |