TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Emerging Properties in Self-Supervised Vision Transformers

Emerging Properties in Self-Supervised Vision Transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, Armand Joulin

2021-04-29ICCV 2021 10Self-Supervised Image ClassificationVideo Object DetectionImage ClassificationVisual Place RecognitionSelf-Supervised LearningSemantic SegmentationCopy DetectionVideo Object SegmentationSingle-object discoveryLinear evaluationImage Retrieval
PaperPDFCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCode(official)CodeCodeCodeCodeCodeCodeCodeCodeCode

Abstract

In this paper, we question if self-supervised learning provides new properties to Vision Transformer (ViT) that stand out compared to convolutional networks (convnets). Beyond the fact that adapting self-supervised methods to this architecture works particularly well, we make the following observations: first, self-supervised ViT features contain explicit information about the semantic segmentation of an image, which does not emerge as clearly with supervised ViTs, nor with convnets. Second, these features are also excellent k-NN classifiers, reaching 78.3% top-1 on ImageNet with a small ViT. Our study also underlines the importance of momentum encoder, multi-crop training, and the use of small patches with ViTs. We implement our findings into a simple self-supervised method, called DINO, which we interpret as a form of self-distillation with no labels. We show the synergy between DINO and ViTs by achieving 80.1% top-1 on ImageNet in linear evaluation with ViT-Base.

Results

TaskDatasetMetricValueModel
VideoDAVIS 2017J&F71.4DINO (ViT-B/8, ImageNet retrain)
Image RetrievalROxford (Medium)mAP51.5Dino
Image RetrievalRParis (Medium)mAP75.3Dino
Image RetrievalRParis (Hard)mAP51.6Dino
Image RetrievalROxford (Hard)mAP24.3Dino
Visual Place RecognitionNardo-Air RRecall@184.51DINO
Visual Place RecognitionOxford RobotCar DatasetRecall@115.71DINO
Visual Place RecognitionNardo-AirRecall@157.75DINO
Visual Place RecognitionMid-Atlantic RidgeRecall@127.72DINO
Visual Place RecognitionSt LuciaRecall@145.22DINO
Visual Place RecognitionHawkinsRecall@146.61DINO
Visual Place RecognitionLaurel CavernsRecall@141.07DINO
Visual Place RecognitionGardens PointRecall@178.5DINO
Visual Place RecognitionPittsburgh-30k-testRecall@170.13DINO
Visual Place RecognitionVP-AirRecall@124.02DINO
Visual Place Recognition17 PlacesRecall@161.82DINO
Visual Place RecognitionBaidu MallRecall@148.3DINO
Image ClassificationOmniBenchmarkAverage Top-1 Accuracy38.9DINO
Video Object SegmentationDAVIS 2017J&F71.4DINO (ViT-B/8, ImageNet retrain)

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21Visual Place Recognition for Large-Scale UAV Applications2025-07-20Automatic Classification and Segmentation of Tunnel Cracks Based on Deep Learning and Visual Explanations2025-07-18Adversarial attacks to image classification systems using evolutionary algorithms2025-07-17Efficient Adaptation of Pre-trained Vision Transformer underpinned by Approximately Orthogonal Fine-Tuning Strategy2025-07-17Federated Learning for Commercial Image Sources2025-07-17MUPAX: Multidimensional Problem Agnostic eXplainable AI2025-07-17A Semi-Supervised Learning Method for the Identification of Bad Exposures in Large Imaging Surveys2025-07-17