Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick Labatut, Armand Joulin, Piotr Bojanowski
The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar foundation models in computer vision. These models could greatly simplify the use of images in any system by producing all-purpose visual features, i.e., features that work across image distributions and tasks without finetuning. This work shows that existing pretraining methods, especially self-supervised methods, can produce such features if trained on enough curated data from diverse sources. We revisit existing approaches and combine different techniques to scale our pretraining in terms of data and model size. Most of the technical contributions aim at accelerating and stabilizing the training at scale. In terms of data, we propose an automatic pipeline to build a dedicated, diverse, and curated image dataset instead of uncurated data, as typically done in the self-supervised literature. In terms of models, we train a ViT model (Dosovitskiy et al., 2020) with 1B parameters and distill it into a series of smaller models that surpass the best available all-purpose features, OpenCLIP (Ilharco et al., 2021) on most of the benchmarks at image and pixel levels.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Depth Estimation | NYU-Depth V2 | RMS | 0.279 | DINOv2 (ViT-g/14 frozen, w/ DPT decoder) |
| Depth Estimation | NYU-Depth V2 | Delta < 1.25 | 0.9497 | DINOv2 (ViT-g/14 frozen, w/ DPT decoder) |
| Depth Estimation | NYU-Depth V2 | Delta < 1.25^2 | 0.996 | DINOv2 (ViT-g/14 frozen, w/ DPT decoder) |
| Depth Estimation | NYU-Depth V2 | Delta < 1.25^3 | 0.9994 | DINOv2 (ViT-g/14 frozen, w/ DPT decoder) |
| Depth Estimation | NYU-Depth V2 | RMSE | 0.279 | DINOv2 (ViT-g/14 frozen, w/ DPT decoder) |
| Depth Estimation | NYU-Depth V2 | absolute relative error | 0.0907 | DINOv2 (ViT-g/14 frozen, w/ DPT decoder) |
| Depth Estimation | NYU-Depth V2 | log 10 | 0.0371 | DINOv2 (ViT-g/14 frozen, w/ DPT decoder) |
| Depth Estimation | KITTI Eigen split | Delta < 1.25 | 0.968 | DINOv2 (ViT-g/14 frozen, w/ DPT decoder) |
| Depth Estimation | KITTI Eigen split | Delta < 1.25^2 | 0.997 | DINOv2 (ViT-g/14 frozen, w/ DPT decoder) |
| Depth Estimation | KITTI Eigen split | Delta < 1.25^3 | 0.9993 | DINOv2 (ViT-g/14 frozen, w/ DPT decoder) |
| Depth Estimation | KITTI Eigen split | RMSE | 2.1128 | DINOv2 (ViT-g/14 frozen, w/ DPT decoder) |
| Depth Estimation | KITTI Eigen split | RMSE log | 0.0882 | DINOv2 (ViT-g/14 frozen, w/ DPT decoder) |
| Depth Estimation | KITTI Eigen split | Sq Rel | 0.1797 | DINOv2 (ViT-g/14 frozen, w/ DPT decoder) |
| Depth Estimation | KITTI Eigen split | absolute relative error | 0.0652 | DINOv2 (ViT-g/14 frozen, w/ DPT decoder) |
| Domain Adaptation | ImageNet-C | mean Corruption Error (mCE) | 28.2 | DINOv2 (ViT-g/14, frozen model, linear eval) |
| Domain Adaptation | ImageNet-C | mean Corruption Error (mCE) | 31.5 | DINOv2 (ViT-L/14, frozen model, linear eval) |
| Domain Adaptation | ImageNet-C | mean Corruption Error (mCE) | 42.7 | DINOv2 (ViT-B/14, frozen model, linear eval) |
| Domain Adaptation | ImageNet-C | mean Corruption Error (mCE) | 54.4 | DINOv2 (ViT-S/14, frozen model, linear eval) |
| Semantic Segmentation | Fine-Grained Grass Segmentation Dataset | mIoU | 47.57 | DINOv2 |
| Semantic Segmentation | ADE20K | Params (M) | 1080 | DINOv2 (ViT-g/14 frozen model, w/ ViT-Adapter + Mask2former) |
| Semantic Segmentation | ADE20K | Validation mIoU | 60.2 | DINOv2 (ViT-g/14 frozen model, w/ ViT-Adapter + Mask2former) |
| Image Retrieval | AmsterTime | mAP | 50 | DINOv2 distilled (ViT-L/14 frozen) |
| Image Retrieval | AmsterTime | mAP | 46.7 | DINOv2 (ViT-g/14 frozen) |
| Image Retrieval | AmsterTime | mAP | 45.6 | DINOv2 distilled (ViT-B/14 frozen) |
| Image Retrieval | AmsterTime | mAP | 43.5 | DINOv2 distilled (ViT-S/14 frozen) |
| Visual Place Recognition | Nardo-Air R | Recall@1 | 71.83 | DINOv2 |
| Visual Place Recognition | Oxford RobotCar Dataset | Recall@1 | 39.79 | DINOv2 |
| Visual Place Recognition | Nardo-Air | Recall@1 | 73.24 | DINOv2 |
| Visual Place Recognition | Mid-Atlantic Ridge | Recall@1 | 24.75 | DINOv2 |
| Visual Place Recognition | St Lucia | Recall@1 | 78.62 | DINOv2 |
| Visual Place Recognition | Hawkins | Recall@1 | 27.97 | DINOv2 |
| Visual Place Recognition | Laurel Caverns | Recall@1 | 40.18 | DINOv2 |
| Visual Place Recognition | Gardens Point | Recall@1 | 71.5 | DINOv2 |
| Visual Place Recognition | Pittsburgh-30k-test | Recall@1 | 78.32 | DINOv2 |
| Visual Place Recognition | VP-Air | Recall@1 | 45.23 | DINOv2 |
| Visual Place Recognition | 17 Places | Recall@1 | 61.82 | DINOv2 |
| Visual Place Recognition | Baidu Mall | Recall@1 | 49.21 | DINOv2 |
| Image Classification | CIFAR-10 | Percentage correct | 99.5 | DINOv2 (ViT-g/14, frozen model, linear eval) |
| Image Classification | Oxford-IIIT Pet Dataset | Accuracy | 96.7 | DINOv2 (ViT-g/14, frozen model, linear eval) |
| 3D | NYU-Depth V2 | RMS | 0.279 | DINOv2 (ViT-g/14 frozen, w/ DPT decoder) |
| 3D | NYU-Depth V2 | Delta < 1.25 | 0.9497 | DINOv2 (ViT-g/14 frozen, w/ DPT decoder) |
| 3D | NYU-Depth V2 | Delta < 1.25^2 | 0.996 | DINOv2 (ViT-g/14 frozen, w/ DPT decoder) |
| 3D | NYU-Depth V2 | Delta < 1.25^3 | 0.9994 | DINOv2 (ViT-g/14 frozen, w/ DPT decoder) |
| 3D | NYU-Depth V2 | RMSE | 0.279 | DINOv2 (ViT-g/14 frozen, w/ DPT decoder) |
| 3D | NYU-Depth V2 | absolute relative error | 0.0907 | DINOv2 (ViT-g/14 frozen, w/ DPT decoder) |
| 3D | NYU-Depth V2 | log 10 | 0.0371 | DINOv2 (ViT-g/14 frozen, w/ DPT decoder) |
| 3D | KITTI Eigen split | Delta < 1.25 | 0.968 | DINOv2 (ViT-g/14 frozen, w/ DPT decoder) |
| 3D | KITTI Eigen split | Delta < 1.25^2 | 0.997 | DINOv2 (ViT-g/14 frozen, w/ DPT decoder) |
| 3D | KITTI Eigen split | Delta < 1.25^3 | 0.9993 | DINOv2 (ViT-g/14 frozen, w/ DPT decoder) |
| 3D | KITTI Eigen split | RMSE | 2.1128 | DINOv2 (ViT-g/14 frozen, w/ DPT decoder) |
| 3D | KITTI Eigen split | RMSE log | 0.0882 | DINOv2 (ViT-g/14 frozen, w/ DPT decoder) |
| 3D | KITTI Eigen split | Sq Rel | 0.1797 | DINOv2 (ViT-g/14 frozen, w/ DPT decoder) |
| 3D | KITTI Eigen split | absolute relative error | 0.0652 | DINOv2 (ViT-g/14 frozen, w/ DPT decoder) |
| Fine-Grained Image Classification | Oxford-IIIT Pet Dataset | Accuracy | 96.7 | DINOv2 (ViT-g/14, frozen model, linear eval) |
| Domain Generalization | ImageNet-C | mean Corruption Error (mCE) | 28.2 | DINOv2 (ViT-g/14, frozen model, linear eval) |
| Domain Generalization | ImageNet-C | mean Corruption Error (mCE) | 31.5 | DINOv2 (ViT-L/14, frozen model, linear eval) |
| Domain Generalization | ImageNet-C | mean Corruption Error (mCE) | 42.7 | DINOv2 (ViT-B/14, frozen model, linear eval) |
| Domain Generalization | ImageNet-C | mean Corruption Error (mCE) | 54.4 | DINOv2 (ViT-S/14, frozen model, linear eval) |
| 10-shot image generation | Fine-Grained Grass Segmentation Dataset | mIoU | 47.57 | DINOv2 |
| 10-shot image generation | ADE20K | Params (M) | 1080 | DINOv2 (ViT-g/14 frozen model, w/ ViT-Adapter + Mask2former) |
| 10-shot image generation | ADE20K | Validation mIoU | 60.2 | DINOv2 (ViT-g/14 frozen model, w/ ViT-Adapter + Mask2former) |