DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick Labatut, Armand Joulin, Piotr Bojanowski

2023-04-14Self-Supervised Image Classification Image Classification Visual Place Recognition Domain Generalization Semantic Segmentation Depth Estimation Fine-Grained Image Classification Monocular Depth Estimation Image Retrieval

Paper PDF Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code(official)Code Code Code Code Code Code Code

Abstract

The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar foundation models in computer vision. These models could greatly simplify the use of images in any system by producing all-purpose visual features, i.e., features that work across image distributions and tasks without finetuning. This work shows that existing pretraining methods, especially self-supervised methods, can produce such features if trained on enough curated data from diverse sources. We revisit existing approaches and combine different techniques to scale our pretraining in terms of data and model size. Most of the technical contributions aim at accelerating and stabilizing the training at scale. In terms of data, we propose an automatic pipeline to build a dedicated, diverse, and curated image dataset instead of uncurated data, as typically done in the self-supervised literature. In terms of models, we train a ViT model (Dosovitskiy et al., 2020) with 1B parameters and distill it into a series of smaller models that surpass the best available all-purpose features, OpenCLIP (Ilharco et al., 2021) on most of the benchmarks at image and pixel levels.

Results

Task	Dataset	Metric	Value	Model
Depth Estimation	NYU-Depth V2	RMS	0.279	DINOv2 (ViT-g/14 frozen, w/ DPT decoder)
Depth Estimation	NYU-Depth V2	Delta < 1.25	0.9497	DINOv2 (ViT-g/14 frozen, w/ DPT decoder)
Depth Estimation	NYU-Depth V2	Delta < 1.25^2	0.996	DINOv2 (ViT-g/14 frozen, w/ DPT decoder)
Depth Estimation	NYU-Depth V2	Delta < 1.25^3	0.9994	DINOv2 (ViT-g/14 frozen, w/ DPT decoder)
Depth Estimation	NYU-Depth V2	RMSE	0.279	DINOv2 (ViT-g/14 frozen, w/ DPT decoder)
Depth Estimation	NYU-Depth V2	absolute relative error	0.0907	DINOv2 (ViT-g/14 frozen, w/ DPT decoder)
Depth Estimation	NYU-Depth V2	log 10	0.0371	DINOv2 (ViT-g/14 frozen, w/ DPT decoder)
Depth Estimation	KITTI Eigen split	Delta < 1.25	0.968	DINOv2 (ViT-g/14 frozen, w/ DPT decoder)
Depth Estimation	KITTI Eigen split	Delta < 1.25^2	0.997	DINOv2 (ViT-g/14 frozen, w/ DPT decoder)
Depth Estimation	KITTI Eigen split	Delta < 1.25^3	0.9993	DINOv2 (ViT-g/14 frozen, w/ DPT decoder)
Depth Estimation	KITTI Eigen split	RMSE	2.1128	DINOv2 (ViT-g/14 frozen, w/ DPT decoder)
Depth Estimation	KITTI Eigen split	RMSE log	0.0882	DINOv2 (ViT-g/14 frozen, w/ DPT decoder)
Depth Estimation	KITTI Eigen split	Sq Rel	0.1797	DINOv2 (ViT-g/14 frozen, w/ DPT decoder)
Depth Estimation	KITTI Eigen split	absolute relative error	0.0652	DINOv2 (ViT-g/14 frozen, w/ DPT decoder)
Domain Adaptation	ImageNet-C	mean Corruption Error (mCE)	28.2	DINOv2 (ViT-g/14, frozen model, linear eval)
Domain Adaptation	ImageNet-C	mean Corruption Error (mCE)	31.5	DINOv2 (ViT-L/14, frozen model, linear eval)
Domain Adaptation	ImageNet-C	mean Corruption Error (mCE)	42.7	DINOv2 (ViT-B/14, frozen model, linear eval)
Domain Adaptation	ImageNet-C	mean Corruption Error (mCE)	54.4	DINOv2 (ViT-S/14, frozen model, linear eval)
Semantic Segmentation	Fine-Grained Grass Segmentation Dataset	mIoU	47.57	DINOv2
Semantic Segmentation	ADE20K	Params (M)	1080	DINOv2 (ViT-g/14 frozen model, w/ ViT-Adapter + Mask2former)
Semantic Segmentation	ADE20K	Validation mIoU	60.2	DINOv2 (ViT-g/14 frozen model, w/ ViT-Adapter + Mask2former)
Image Retrieval	AmsterTime	mAP	50	DINOv2 distilled (ViT-L/14 frozen)
Image Retrieval	AmsterTime	mAP	46.7	DINOv2 (ViT-g/14 frozen)
Image Retrieval	AmsterTime	mAP	45.6	DINOv2 distilled (ViT-B/14 frozen)
Image Retrieval	AmsterTime	mAP	43.5	DINOv2 distilled (ViT-S/14 frozen)
Visual Place Recognition	Nardo-Air R	Recall@1	71.83	DINOv2
Visual Place Recognition	Oxford RobotCar Dataset	Recall@1	39.79	DINOv2
Visual Place Recognition	Nardo-Air	Recall@1	73.24	DINOv2
Visual Place Recognition	Mid-Atlantic Ridge	Recall@1	24.75	DINOv2
Visual Place Recognition	St Lucia	Recall@1	78.62	DINOv2
Visual Place Recognition	Hawkins	Recall@1	27.97	DINOv2
Visual Place Recognition	Laurel Caverns	Recall@1	40.18	DINOv2
Visual Place Recognition	Gardens Point	Recall@1	71.5	DINOv2
Visual Place Recognition	Pittsburgh-30k-test	Recall@1	78.32	DINOv2
Visual Place Recognition	VP-Air	Recall@1	45.23	DINOv2
Visual Place Recognition	17 Places	Recall@1	61.82	DINOv2
Visual Place Recognition	Baidu Mall	Recall@1	49.21	DINOv2
Image Classification	CIFAR-10	Percentage correct	99.5	DINOv2 (ViT-g/14, frozen model, linear eval)
Image Classification	Oxford-IIIT Pet Dataset	Accuracy	96.7	DINOv2 (ViT-g/14, frozen model, linear eval)
3D	NYU-Depth V2	RMS	0.279	DINOv2 (ViT-g/14 frozen, w/ DPT decoder)
3D	NYU-Depth V2	Delta < 1.25	0.9497	DINOv2 (ViT-g/14 frozen, w/ DPT decoder)
3D	NYU-Depth V2	Delta < 1.25^2	0.996	DINOv2 (ViT-g/14 frozen, w/ DPT decoder)
3D	NYU-Depth V2	Delta < 1.25^3	0.9994	DINOv2 (ViT-g/14 frozen, w/ DPT decoder)
3D	NYU-Depth V2	RMSE	0.279	DINOv2 (ViT-g/14 frozen, w/ DPT decoder)
3D	NYU-Depth V2	absolute relative error	0.0907	DINOv2 (ViT-g/14 frozen, w/ DPT decoder)
3D	NYU-Depth V2	log 10	0.0371	DINOv2 (ViT-g/14 frozen, w/ DPT decoder)
3D	KITTI Eigen split	Delta < 1.25	0.968	DINOv2 (ViT-g/14 frozen, w/ DPT decoder)
3D	KITTI Eigen split	Delta < 1.25^2	0.997	DINOv2 (ViT-g/14 frozen, w/ DPT decoder)
3D	KITTI Eigen split	Delta < 1.25^3	0.9993	DINOv2 (ViT-g/14 frozen, w/ DPT decoder)
3D	KITTI Eigen split	RMSE	2.1128	DINOv2 (ViT-g/14 frozen, w/ DPT decoder)
3D	KITTI Eigen split	RMSE log	0.0882	DINOv2 (ViT-g/14 frozen, w/ DPT decoder)
3D	KITTI Eigen split	Sq Rel	0.1797	DINOv2 (ViT-g/14 frozen, w/ DPT decoder)
3D	KITTI Eigen split	absolute relative error	0.0652	DINOv2 (ViT-g/14 frozen, w/ DPT decoder)
Fine-Grained Image Classification	Oxford-IIIT Pet Dataset	Accuracy	96.7	DINOv2 (ViT-g/14, frozen model, linear eval)
Domain Generalization	ImageNet-C	mean Corruption Error (mCE)	28.2	DINOv2 (ViT-g/14, frozen model, linear eval)
Domain Generalization	ImageNet-C	mean Corruption Error (mCE)	31.5	DINOv2 (ViT-L/14, frozen model, linear eval)
Domain Generalization	ImageNet-C	mean Corruption Error (mCE)	42.7	DINOv2 (ViT-B/14, frozen model, linear eval)
Domain Generalization	ImageNet-C	mean Corruption Error (mCE)	54.4	DINOv2 (ViT-S/14, frozen model, linear eval)
10-shot image generation	Fine-Grained Grass Segmentation Dataset	mIoU	47.57	DINOv2
10-shot image generation	ADE20K	Params (M)	1080	DINOv2 (ViT-g/14 frozen model, w/ ViT-Adapter + Mask2former)
10-shot image generation	ADE20K	Validation mIoU	60.2	DINOv2 (ViT-g/14 frozen model, w/ ViT-Adapter + Mask2former)

Abstract

Results

Task	Dataset	Metric	Value	Model
Depth Estimation	NYU-Depth V2	RMS	0.279	DINOv2 (ViT-g/14 frozen, w/ DPT decoder)
Depth Estimation	NYU-Depth V2	Delta < 1.25	0.9497	DINOv2 (ViT-g/14 frozen, w/ DPT decoder)
Depth Estimation	NYU-Depth V2	Delta < 1.25^2	0.996	DINOv2 (ViT-g/14 frozen, w/ DPT decoder)
Depth Estimation	NYU-Depth V2	Delta < 1.25^3	0.9994	DINOv2 (ViT-g/14 frozen, w/ DPT decoder)
Depth Estimation	NYU-Depth V2	RMSE	0.279	DINOv2 (ViT-g/14 frozen, w/ DPT decoder)
Depth Estimation	NYU-Depth V2	absolute relative error	0.0907	DINOv2 (ViT-g/14 frozen, w/ DPT decoder)
Depth Estimation	NYU-Depth V2	log 10	0.0371	DINOv2 (ViT-g/14 frozen, w/ DPT decoder)
Depth Estimation	KITTI Eigen split	Delta < 1.25	0.968	DINOv2 (ViT-g/14 frozen, w/ DPT decoder)
Depth Estimation	KITTI Eigen split	Delta < 1.25^2	0.997	DINOv2 (ViT-g/14 frozen, w/ DPT decoder)
Depth Estimation	KITTI Eigen split	Delta < 1.25^3	0.9993	DINOv2 (ViT-g/14 frozen, w/ DPT decoder)
Depth Estimation	KITTI Eigen split	RMSE	2.1128	DINOv2 (ViT-g/14 frozen, w/ DPT decoder)
Depth Estimation	KITTI Eigen split	RMSE log	0.0882	DINOv2 (ViT-g/14 frozen, w/ DPT decoder)
Depth Estimation	KITTI Eigen split	Sq Rel	0.1797	DINOv2 (ViT-g/14 frozen, w/ DPT decoder)
Depth Estimation	KITTI Eigen split	absolute relative error	0.0652	DINOv2 (ViT-g/14 frozen, w/ DPT decoder)
Domain Adaptation	ImageNet-C	mean Corruption Error (mCE)	28.2	DINOv2 (ViT-g/14, frozen model, linear eval)
Domain Adaptation	ImageNet-C	mean Corruption Error (mCE)	31.5	DINOv2 (ViT-L/14, frozen model, linear eval)
Domain Adaptation	ImageNet-C	mean Corruption Error (mCE)	42.7	DINOv2 (ViT-B/14, frozen model, linear eval)
Domain Adaptation	ImageNet-C	mean Corruption Error (mCE)	54.4	DINOv2 (ViT-S/14, frozen model, linear eval)
Semantic Segmentation	Fine-Grained Grass Segmentation Dataset	mIoU	47.57	DINOv2
Semantic Segmentation	ADE20K	Params (M)	1080	DINOv2 (ViT-g/14 frozen model, w/ ViT-Adapter + Mask2former)
Semantic Segmentation	ADE20K	Validation mIoU	60.2	DINOv2 (ViT-g/14 frozen model, w/ ViT-Adapter + Mask2former)
Image Retrieval	AmsterTime	mAP	50	DINOv2 distilled (ViT-L/14 frozen)
Image Retrieval	AmsterTime	mAP	46.7	DINOv2 (ViT-g/14 frozen)
Image Retrieval	AmsterTime	mAP	45.6	DINOv2 distilled (ViT-B/14 frozen)
Image Retrieval	AmsterTime	mAP	43.5	DINOv2 distilled (ViT-S/14 frozen)
Visual Place Recognition	Nardo-Air R	Recall@1	71.83	DINOv2
Visual Place Recognition	Oxford RobotCar Dataset	Recall@1	39.79	DINOv2
Visual Place Recognition	Nardo-Air	Recall@1	73.24	DINOv2
Visual Place Recognition	Mid-Atlantic Ridge	Recall@1	24.75	DINOv2
Visual Place Recognition	St Lucia	Recall@1	78.62	DINOv2
Visual Place Recognition	Hawkins	Recall@1	27.97	DINOv2
Visual Place Recognition	Laurel Caverns	Recall@1	40.18	DINOv2
Visual Place Recognition	Gardens Point	Recall@1	71.5	DINOv2
Visual Place Recognition	Pittsburgh-30k-test	Recall@1	78.32	DINOv2
Visual Place Recognition	VP-Air	Recall@1	45.23	DINOv2
Visual Place Recognition	17 Places	Recall@1	61.82	DINOv2
Visual Place Recognition	Baidu Mall	Recall@1	49.21	DINOv2
Image Classification	CIFAR-10	Percentage correct	99.5	DINOv2 (ViT-g/14, frozen model, linear eval)
Image Classification	Oxford-IIIT Pet Dataset	Accuracy	96.7	DINOv2 (ViT-g/14, frozen model, linear eval)
3D	NYU-Depth V2	RMS	0.279	DINOv2 (ViT-g/14 frozen, w/ DPT decoder)
3D	NYU-Depth V2	Delta < 1.25	0.9497	DINOv2 (ViT-g/14 frozen, w/ DPT decoder)
3D	NYU-Depth V2	Delta < 1.25^2	0.996	DINOv2 (ViT-g/14 frozen, w/ DPT decoder)
3D	NYU-Depth V2	Delta < 1.25^3	0.9994	DINOv2 (ViT-g/14 frozen, w/ DPT decoder)
3D	NYU-Depth V2	RMSE	0.279	DINOv2 (ViT-g/14 frozen, w/ DPT decoder)
3D	NYU-Depth V2	absolute relative error	0.0907	DINOv2 (ViT-g/14 frozen, w/ DPT decoder)
3D	NYU-Depth V2	log 10	0.0371	DINOv2 (ViT-g/14 frozen, w/ DPT decoder)
3D	KITTI Eigen split	Delta < 1.25	0.968	DINOv2 (ViT-g/14 frozen, w/ DPT decoder)
3D	KITTI Eigen split	Delta < 1.25^2	0.997	DINOv2 (ViT-g/14 frozen, w/ DPT decoder)
3D	KITTI Eigen split	Delta < 1.25^3	0.9993	DINOv2 (ViT-g/14 frozen, w/ DPT decoder)
3D	KITTI Eigen split	RMSE	2.1128	DINOv2 (ViT-g/14 frozen, w/ DPT decoder)
3D	KITTI Eigen split	RMSE log	0.0882	DINOv2 (ViT-g/14 frozen, w/ DPT decoder)
3D	KITTI Eigen split	Sq Rel	0.1797	DINOv2 (ViT-g/14 frozen, w/ DPT decoder)
3D	KITTI Eigen split	absolute relative error	0.0652	DINOv2 (ViT-g/14 frozen, w/ DPT decoder)
Fine-Grained Image Classification	Oxford-IIIT Pet Dataset	Accuracy	96.7	DINOv2 (ViT-g/14, frozen model, linear eval)
Domain Generalization	ImageNet-C	mean Corruption Error (mCE)	28.2	DINOv2 (ViT-g/14, frozen model, linear eval)
Domain Generalization	ImageNet-C	mean Corruption Error (mCE)	31.5	DINOv2 (ViT-L/14, frozen model, linear eval)
Domain Generalization	ImageNet-C	mean Corruption Error (mCE)	42.7	DINOv2 (ViT-B/14, frozen model, linear eval)
Domain Generalization	ImageNet-C	mean Corruption Error (mCE)	54.4	DINOv2 (ViT-S/14, frozen model, linear eval)
10-shot image generation	Fine-Grained Grass Segmentation Dataset	mIoU	47.57	DINOv2
10-shot image generation	ADE20K	Params (M)	1080	DINOv2 (ViT-g/14 frozen model, w/ ViT-Adapter + Mask2former)
10-shot image generation	ADE20K	Validation mIoU	60.2	DINOv2 (ViT-g/14 frozen model, w/ ViT-Adapter + Mask2former)

DINOv2: Learning Robust Visual Features without Supervision

Abstract

Results

Related Papers

DINOv2: Learning Robust Visual Features without Supervision

Abstract

Results

Related Papers