Vision Transformers for Dense Prediction

René Ranftl, Alexey Bochkovskiy, Vladlen Koltun

2021-03-24ICCV 2021 10Semantic Segmentation Prediction Depth Estimation Monocular Depth Estimation

Paper PDF Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code

Abstract

We introduce dense vision transformers, an architecture that leverages vision transformers in place of convolutional networks as a backbone for dense prediction tasks. We assemble tokens from various stages of the vision transformer into image-like representations at various resolutions and progressively combine them into full-resolution predictions using a convolutional decoder. The transformer backbone processes representations at a constant and relatively high resolution and has a global receptive field at every stage. These properties allow the dense vision transformer to provide finer-grained and more globally coherent predictions when compared to fully-convolutional networks. Our experiments show that this architecture yields substantial improvements on dense prediction tasks, especially when a large amount of training data is available. For monocular depth estimation, we observe an improvement of up to 28% in relative performance when compared to a state-of-the-art fully-convolutional network. When applied to semantic segmentation, dense vision transformers set a new state of the art on ADE20K with 49.02% mIoU. We further show that the architecture can be fine-tuned on smaller datasets such as NYUv2, KITTI, and Pascal Context where it also sets the new state of the art. Our models are available at https://github.com/intel-isl/DPT.

Results

Task	Dataset	Metric	Value	Model
Depth Estimation	NYU-Depth V2	Delta < 1.25	0.904	DPT-Hybrid
Depth Estimation	NYU-Depth V2	Delta < 1.25^2	0.988	DPT-Hybrid
Depth Estimation	NYU-Depth V2	Delta < 1.25^3	0.994	DPT-Hybrid
Depth Estimation	NYU-Depth V2	RMSE	0.357	DPT-Hybrid
Depth Estimation	NYU-Depth V2	absolute relative error	0.11	DPT-Hybrid
Depth Estimation	NYU-Depth V2	log 10	0.045	DPT-Hybrid
Depth Estimation	ETH3D	Delta < 1.25	0.0946	DPT
Depth Estimation	ETH3D	absolute relative error	0.078	DPT
Depth Estimation	KITTI Eigen split	Delta < 1.25	0.959	DPT-Hybrid
Depth Estimation	KITTI Eigen split	Delta < 1.25^2	0.995	DPT-Hybrid
Depth Estimation	KITTI Eigen split	Delta < 1.25^3	0.999	DPT-Hybrid
Depth Estimation	KITTI Eigen split	RMSE	2.573	DPT-Hybrid
Depth Estimation	KITTI Eigen split	RMSE log	0.092	DPT-Hybrid
Depth Estimation	KITTI Eigen split	absolute relative error	0.062	DPT-Hybrid
Semantic Segmentation	ADE20K val	Pixel Accuracy	83.11	DPT-Hybrid
Semantic Segmentation	ADE20K val	mIoU	49.02	DPT-Hybrid
Semantic Segmentation	PASCAL Context	mIoU	60.46	DPT-Hybrid
Semantic Segmentation	ADE20K	Validation mIoU	49.02	DPT-Hybrid
3D	NYU-Depth V2	Delta < 1.25	0.904	DPT-Hybrid
3D	NYU-Depth V2	Delta < 1.25^2	0.988	DPT-Hybrid
3D	NYU-Depth V2	Delta < 1.25^3	0.994	DPT-Hybrid
3D	NYU-Depth V2	RMSE	0.357	DPT-Hybrid
3D	NYU-Depth V2	absolute relative error	0.11	DPT-Hybrid
3D	NYU-Depth V2	log 10	0.045	DPT-Hybrid
3D	ETH3D	Delta < 1.25	0.0946	DPT
3D	ETH3D	absolute relative error	0.078	DPT
3D	KITTI Eigen split	Delta < 1.25	0.959	DPT-Hybrid
3D	KITTI Eigen split	Delta < 1.25^2	0.995	DPT-Hybrid
3D	KITTI Eigen split	Delta < 1.25^3	0.999	DPT-Hybrid
3D	KITTI Eigen split	RMSE	2.573	DPT-Hybrid
3D	KITTI Eigen split	RMSE log	0.092	DPT-Hybrid
3D	KITTI Eigen split	absolute relative error	0.062	DPT-Hybrid
10-shot image generation	ADE20K val	Pixel Accuracy	83.11	DPT-Hybrid
10-shot image generation	ADE20K val	mIoU	49.02	DPT-Hybrid
10-shot image generation	PASCAL Context	mIoU	60.46	DPT-Hybrid
10-shot image generation	ADE20K	Validation mIoU	49.02	DPT-Hybrid

Vision Transformers for Dense Prediction

Abstract

Results

Related Papers

Vision Transformers for Dense Prediction

Abstract

Results

Related Papers