Emerging Properties in Self-Supervised Vision Transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, Armand Joulin

2021-04-29ICCV 2021 10Self-Supervised Image Classification Video Object Detection Image Classification Visual Place Recognition Self-Supervised Learning Semantic Segmentation Copy Detection Video Object Segmentation Single-object discovery Linear evaluation Image Retrieval

Paper PDF Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code(official)Code Code Code Code Code Code Code Code Code

Abstract

In this paper, we question if self-supervised learning provides new properties to Vision Transformer (ViT) that stand out compared to convolutional networks (convnets). Beyond the fact that adapting self-supervised methods to this architecture works particularly well, we make the following observations: first, self-supervised ViT features contain explicit information about the semantic segmentation of an image, which does not emerge as clearly with supervised ViTs, nor with convnets. Second, these features are also excellent k-NN classifiers, reaching 78.3% top-1 on ImageNet with a small ViT. Our study also underlines the importance of momentum encoder, multi-crop training, and the use of small patches with ViTs. We implement our findings into a simple self-supervised method, called DINO, which we interpret as a form of self-distillation with no labels. We show the synergy between DINO and ViTs by achieving 80.1% top-1 on ImageNet in linear evaluation with ViT-Base.

Results

Task	Dataset	Metric	Value	Model
Video	DAVIS 2017	J&F	71.4	DINO (ViT-B/8, ImageNet retrain)
Image Retrieval	ROxford (Medium)	mAP	51.5	Dino
Image Retrieval	RParis (Medium)	mAP	75.3	Dino
Image Retrieval	RParis (Hard)	mAP	51.6	Dino
Image Retrieval	ROxford (Hard)	mAP	24.3	Dino
Visual Place Recognition	Nardo-Air R	Recall@1	84.51	DINO
Visual Place Recognition	Oxford RobotCar Dataset	Recall@1	15.71	DINO
Visual Place Recognition	Nardo-Air	Recall@1	57.75	DINO
Visual Place Recognition	Mid-Atlantic Ridge	Recall@1	27.72	DINO
Visual Place Recognition	St Lucia	Recall@1	45.22	DINO
Visual Place Recognition	Hawkins	Recall@1	46.61	DINO
Visual Place Recognition	Laurel Caverns	Recall@1	41.07	DINO
Visual Place Recognition	Gardens Point	Recall@1	78.5	DINO
Visual Place Recognition	Pittsburgh-30k-test	Recall@1	70.13	DINO
Visual Place Recognition	VP-Air	Recall@1	24.02	DINO
Visual Place Recognition	17 Places	Recall@1	61.82	DINO
Visual Place Recognition	Baidu Mall	Recall@1	48.3	DINO
Image Classification	OmniBenchmark	Average Top-1 Accuracy	38.9	DINO
Video Object Segmentation	DAVIS 2017	J&F	71.4	DINO (ViT-B/8, ImageNet retrain)

Emerging Properties in Self-Supervised Vision Transformers

Abstract

Results

Related Papers

Emerging Properties in Self-Supervised Vision Transformers

Abstract

Results

Related Papers