TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/DINOv2: Learning Robust Visual Features without Supervision

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick Labatut, Armand Joulin, Piotr Bojanowski

2023-04-14Self-Supervised Image ClassificationImage ClassificationVisual Place RecognitionDomain GeneralizationSemantic SegmentationDepth EstimationFine-Grained Image ClassificationMonocular Depth EstimationImage Retrieval
PaperPDFCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCode(official)CodeCodeCodeCodeCodeCodeCode

Abstract

The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar foundation models in computer vision. These models could greatly simplify the use of images in any system by producing all-purpose visual features, i.e., features that work across image distributions and tasks without finetuning. This work shows that existing pretraining methods, especially self-supervised methods, can produce such features if trained on enough curated data from diverse sources. We revisit existing approaches and combine different techniques to scale our pretraining in terms of data and model size. Most of the technical contributions aim at accelerating and stabilizing the training at scale. In terms of data, we propose an automatic pipeline to build a dedicated, diverse, and curated image dataset instead of uncurated data, as typically done in the self-supervised literature. In terms of models, we train a ViT model (Dosovitskiy et al., 2020) with 1B parameters and distill it into a series of smaller models that surpass the best available all-purpose features, OpenCLIP (Ilharco et al., 2021) on most of the benchmarks at image and pixel levels.

Results

TaskDatasetMetricValueModel
Depth EstimationNYU-Depth V2RMS0.279DINOv2 (ViT-g/14 frozen, w/ DPT decoder)
Depth EstimationNYU-Depth V2Delta < 1.250.9497DINOv2 (ViT-g/14 frozen, w/ DPT decoder)
Depth EstimationNYU-Depth V2Delta < 1.25^20.996DINOv2 (ViT-g/14 frozen, w/ DPT decoder)
Depth EstimationNYU-Depth V2Delta < 1.25^30.9994DINOv2 (ViT-g/14 frozen, w/ DPT decoder)
Depth EstimationNYU-Depth V2RMSE0.279DINOv2 (ViT-g/14 frozen, w/ DPT decoder)
Depth EstimationNYU-Depth V2absolute relative error0.0907DINOv2 (ViT-g/14 frozen, w/ DPT decoder)
Depth EstimationNYU-Depth V2log 100.0371DINOv2 (ViT-g/14 frozen, w/ DPT decoder)
Depth EstimationKITTI Eigen splitDelta < 1.250.968DINOv2 (ViT-g/14 frozen, w/ DPT decoder)
Depth EstimationKITTI Eigen splitDelta < 1.25^20.997DINOv2 (ViT-g/14 frozen, w/ DPT decoder)
Depth EstimationKITTI Eigen splitDelta < 1.25^30.9993DINOv2 (ViT-g/14 frozen, w/ DPT decoder)
Depth EstimationKITTI Eigen splitRMSE2.1128DINOv2 (ViT-g/14 frozen, w/ DPT decoder)
Depth EstimationKITTI Eigen splitRMSE log0.0882DINOv2 (ViT-g/14 frozen, w/ DPT decoder)
Depth EstimationKITTI Eigen splitSq Rel0.1797DINOv2 (ViT-g/14 frozen, w/ DPT decoder)
Depth EstimationKITTI Eigen splitabsolute relative error0.0652DINOv2 (ViT-g/14 frozen, w/ DPT decoder)
Domain AdaptationImageNet-Cmean Corruption Error (mCE)28.2DINOv2 (ViT-g/14, frozen model, linear eval)
Domain AdaptationImageNet-Cmean Corruption Error (mCE)31.5DINOv2 (ViT-L/14, frozen model, linear eval)
Domain AdaptationImageNet-Cmean Corruption Error (mCE)42.7DINOv2 (ViT-B/14, frozen model, linear eval)
Domain AdaptationImageNet-Cmean Corruption Error (mCE)54.4DINOv2 (ViT-S/14, frozen model, linear eval)
Semantic SegmentationFine-Grained Grass Segmentation DatasetmIoU47.57DINOv2
Semantic SegmentationADE20KParams (M)1080DINOv2 (ViT-g/14 frozen model, w/ ViT-Adapter + Mask2former)
Semantic SegmentationADE20KValidation mIoU60.2DINOv2 (ViT-g/14 frozen model, w/ ViT-Adapter + Mask2former)
Image RetrievalAmsterTimemAP50DINOv2 distilled (ViT-L/14 frozen)
Image RetrievalAmsterTimemAP46.7DINOv2 (ViT-g/14 frozen)
Image RetrievalAmsterTimemAP45.6DINOv2 distilled (ViT-B/14 frozen)
Image RetrievalAmsterTimemAP43.5DINOv2 distilled (ViT-S/14 frozen)
Visual Place RecognitionNardo-Air RRecall@171.83DINOv2
Visual Place RecognitionOxford RobotCar DatasetRecall@139.79DINOv2
Visual Place RecognitionNardo-AirRecall@173.24DINOv2
Visual Place RecognitionMid-Atlantic RidgeRecall@124.75DINOv2
Visual Place RecognitionSt LuciaRecall@178.62DINOv2
Visual Place RecognitionHawkinsRecall@127.97DINOv2
Visual Place RecognitionLaurel CavernsRecall@140.18DINOv2
Visual Place RecognitionGardens PointRecall@171.5DINOv2
Visual Place RecognitionPittsburgh-30k-testRecall@178.32DINOv2
Visual Place RecognitionVP-AirRecall@145.23DINOv2
Visual Place Recognition17 PlacesRecall@161.82DINOv2
Visual Place RecognitionBaidu MallRecall@149.21DINOv2
Image ClassificationCIFAR-10Percentage correct99.5DINOv2 (ViT-g/14, frozen model, linear eval)
Image ClassificationOxford-IIIT Pet DatasetAccuracy96.7DINOv2 (ViT-g/14, frozen model, linear eval)
3DNYU-Depth V2RMS0.279DINOv2 (ViT-g/14 frozen, w/ DPT decoder)
3DNYU-Depth V2Delta < 1.250.9497DINOv2 (ViT-g/14 frozen, w/ DPT decoder)
3DNYU-Depth V2Delta < 1.25^20.996DINOv2 (ViT-g/14 frozen, w/ DPT decoder)
3DNYU-Depth V2Delta < 1.25^30.9994DINOv2 (ViT-g/14 frozen, w/ DPT decoder)
3DNYU-Depth V2RMSE0.279DINOv2 (ViT-g/14 frozen, w/ DPT decoder)
3DNYU-Depth V2absolute relative error0.0907DINOv2 (ViT-g/14 frozen, w/ DPT decoder)
3DNYU-Depth V2log 100.0371DINOv2 (ViT-g/14 frozen, w/ DPT decoder)
3DKITTI Eigen splitDelta < 1.250.968DINOv2 (ViT-g/14 frozen, w/ DPT decoder)
3DKITTI Eigen splitDelta < 1.25^20.997DINOv2 (ViT-g/14 frozen, w/ DPT decoder)
3DKITTI Eigen splitDelta < 1.25^30.9993DINOv2 (ViT-g/14 frozen, w/ DPT decoder)
3DKITTI Eigen splitRMSE2.1128DINOv2 (ViT-g/14 frozen, w/ DPT decoder)
3DKITTI Eigen splitRMSE log0.0882DINOv2 (ViT-g/14 frozen, w/ DPT decoder)
3DKITTI Eigen splitSq Rel0.1797DINOv2 (ViT-g/14 frozen, w/ DPT decoder)
3DKITTI Eigen splitabsolute relative error0.0652DINOv2 (ViT-g/14 frozen, w/ DPT decoder)
Fine-Grained Image ClassificationOxford-IIIT Pet DatasetAccuracy96.7DINOv2 (ViT-g/14, frozen model, linear eval)
Domain GeneralizationImageNet-Cmean Corruption Error (mCE)28.2DINOv2 (ViT-g/14, frozen model, linear eval)
Domain GeneralizationImageNet-Cmean Corruption Error (mCE)31.5DINOv2 (ViT-L/14, frozen model, linear eval)
Domain GeneralizationImageNet-Cmean Corruption Error (mCE)42.7DINOv2 (ViT-B/14, frozen model, linear eval)
Domain GeneralizationImageNet-Cmean Corruption Error (mCE)54.4DINOv2 (ViT-S/14, frozen model, linear eval)
10-shot image generationFine-Grained Grass Segmentation DatasetmIoU47.57DINOv2
10-shot image generationADE20KParams (M)1080DINOv2 (ViT-g/14 frozen model, w/ ViT-Adapter + Mask2former)
10-shot image generationADE20KValidation mIoU60.2DINOv2 (ViT-g/14 frozen model, w/ ViT-Adapter + Mask2former)

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21Visual Place Recognition for Large-Scale UAV Applications2025-07-20Automatic Classification and Segmentation of Tunnel Cracks Based on Deep Learning and Visual Explanations2025-07-18Adversarial attacks to image classification systems using evolutionary algorithms2025-07-17Efficient Adaptation of Pre-trained Vision Transformer underpinned by Approximately Orthogonal Fine-Tuning Strategy2025-07-17Federated Learning for Commercial Image Sources2025-07-17MUPAX: Multidimensional Problem Agnostic eXplainable AI2025-07-17Simulate, Refocus and Ensemble: An Attention-Refocusing Scheme for Domain Generalization2025-07-17