Artyom Gadetsky, Yulun Jiang, Maria Brbic
Foundation vision-language models have enabled remarkable zero-shot transferability of the pre-trained representations to a wide range of downstream tasks. However, to solve a new task, zero-shot transfer still necessitates human guidance to define visual categories that appear in the data. Here, we show that fully unsupervised transfer emerges when searching for the labeling of a dataset that induces maximal margin classifiers in representation spaces of different foundation models. We present TURTLE, a fully unsupervised method that effectively employs this guiding principle to uncover the underlying labeling of a downstream dataset without any supervision and task-specific representation learning. We evaluate TURTLE on a diverse benchmark suite of 26 datasets and show that it achieves new state-of-the-art unsupervised performance. Furthermore, TURTLE, although being fully unsupervised, outperforms zero-shot transfer baselines on a wide range of datasets. In particular, TURTLE matches the average performance of CLIP zero-shot on 26 datasets by employing the same representation space, spanning a wide range of architectures and model sizes. By guiding the search for the underlying labeling using the representation spaces of two foundation models, TURTLE surpasses zero-shot transfer and unsupervised prompt tuning baselines, demonstrating the surprising power and effectiveness of unsupervised transfer.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Image Clustering | Stanford Cars | Accuracy | 0.646 | TURTLE (CLIP + DINOv2) |
| Image Clustering | Kinetics-700 | Accuracy | 43 | TURTLE (CLIP + DINOv2) |
| Image Clustering | PCam | Accuracy | 52 | TURTLE (CLIP + DINOv2) |
| Image Clustering | DTD | Accuracy | 57.3 | TURTLE (CLIP + DINOv2) |
| Image Clustering | GTSRB | Accuracy | 48.4 | TURTLE (CLIP + DINOv2) |
| Image Clustering | SUN397 | Accuracy | 67.9 | TURTLE (CLIP + DINOv2) |
| Image Clustering | EuroSAT | Accuracy | 96.6 | TURTLE (CLIP + DINOv2) |
| Image Clustering | CIFAR-10 | ARI | 0.989 | TURTLE (CLIP + DINOv2) |
| Image Clustering | CIFAR-10 | Accuracy | 0.995 | TURTLE (CLIP + DINOv2) |
| Image Clustering | CIFAR-10 | NMI | 0.985 | TURTLE (CLIP + DINOv2) |
| Image Clustering | Caltech-101 | Accuracy | 89.8 | TURTLE (CLIP + DINOv2) |
| Image Clustering | CLEVR Counts | Accuracy | 24 | TURTLE (CLIP + DINOv2) |
| Image Clustering | Hateful Memes | Accuracy | 54.2 | TURTLE (CLIP + DINOv2) |
| Image Clustering | KITTI | Accuracy | 39.4 | TURTLE (CLIP + DINOv2) |
| Image Clustering | CIFAR-100 | ARI | 0.834 | TURTLE (CLIP + DINOv2) |
| Image Clustering | CIFAR-100 | Accuracy | 0.898 | TURTLE (CLIP + DINOv2) |
| Image Clustering | CIFAR-100 | NMI | 0.915 | TURTLE (CLIP + DINOv2) |
| Image Clustering | UCF101 | Accuracy | 82.3 | TURTLE (CLIP + DINOv2) |
| Image Clustering | FGVC Aircraft | Accuracy | 36.5 | TURTLE (CLIP + DINOv2) |
| Image Clustering | MNIST | Accuracy | 97.8 | TURTLE (CLIP + DINOv2) |
| Image Clustering | Flowers-102 | Accuracy | 99.6 | TURTLE (CLIP + DINOv2) |
| Image Clustering | Birdsnap | Accuracy | 68.1 | TURTLE (CLIP + DINOv2) |
| Image Clustering | STL-10 | ARI | 0.994 | TURTLE (CLIP + DINOv2) |
| Image Clustering | STL-10 | Accuracy | 0.997 | TURTLE (CLIP + DINOv2) |
| Image Clustering | STL-10 | NMI | 0.993 | TURTLE (CLIP + DINOv2) |
| Image Clustering | Oxford-IIIT Pets | Accuracy | 92.3 | TURTLE (CLIP + DINOv2) |
| Image Clustering | ImageNet | ARI | 62.5 | TURTLE (CLIP + DINOv2) |
| Image Clustering | ImageNet | Accuracy | 72.9 | TURTLE (CLIP + DINOv2) |
| Image Clustering | ImageNet | NMI | 88.2 | TURTLE (CLIP + DINOv2) |
| Image Clustering | Country211 | Accuracy | 11.1 | TURTLE (CLIP + DINOv2) |
| Image Clustering | Rendered SST2 | Accuracy | 51.6 | TURTLE (CLIP + DINOv2) |
| Image Clustering | Food-101 | Accuracy | 92.2 | TURTLE (CLIP + DINOv2) |
| Image Clustering | FER2013 | Accuracy | 36.2 | TURTLE (CLIP + DINOv2) |
| Image Clustering | RESISC45 | Accuracy | 89.6 | TURTLE (CLIP + DINOv2) |
| Image Classification | STL-10 | Accuracy | 99.7 | TURTLE (CLIP + DINOv2) |
| Image Classification | CIFAR-10 | Accuracy | 99.5 | TURTLE (CLIP + DINOv2) |
| Image Classification | MNIST | Accuracy | 97.8 | TURTLE (CLIP + DINOv2) |
| Image Classification | ImageNet | ARI | 62.5 | TURTLE (CLIP + DINOv2) |
| Image Classification | ImageNet | Accuracy (%) | 72.9 | TURTLE (CLIP + DINOv2) |