Depthformer : Multiscale Vision Transformer For Monocular Depth Estimation With Local Global Information Fusion

Ashutosh Agarwal, Chetan Arora

2022-07-10Depth Prediction Semantic Segmentation Depth Estimation Monocular Depth Estimation

Abstract

Attention-based models such as transformers have shown outstanding performance on dense prediction tasks, such as semantic segmentation, owing to their capability of capturing long-range dependency in an image. However, the benefit of transformers for monocular depth prediction has seldom been explored so far. This paper benchmarks various transformer-based models for the depth estimation task on an indoor NYUV2 dataset and an outdoor KITTI dataset. We propose a novel attention-based architecture, Depthformer for monocular depth estimation that uses multi-head self-attention to produce the multiscale feature maps, which are effectively combined by our proposed decoder network. We also propose a Transbins module that divides the depth range into bins whose center value is estimated adaptively per image. The final depth estimated is a linear combination of bin centers for each pixel. Transbins module takes advantage of the global receptive field using the transformer module in the encoding stage. Experimental results on NYUV2 and KITTI depth estimation benchmark demonstrate that our proposed method improves the state-of-the-art by 3.3%, and 3.3% respectively in terms of Root Mean Squared Error (RMSE). Code is available at https://github.com/ashutosh1807/Depthformer.git.

Results

Task	Dataset	Metric	Value	Model
Depth Estimation	NYU-Depth V2	Delta < 1.25	0.913	Depthformer
Depth Estimation	NYU-Depth V2	Delta < 1.25^2	0.988	Depthformer
Depth Estimation	NYU-Depth V2	Delta < 1.25^3	0.997	Depthformer
Depth Estimation	NYU-Depth V2	RMSE	0.345	Depthformer
Depth Estimation	NYU-Depth V2	absolute relative error	0.1	Depthformer
Depth Estimation	NYU-Depth V2	log 10	0.042	Depthformer
Depth Estimation	KITTI Eigen split	Delta < 1.25	0.967	Depthformer
Depth Estimation	KITTI Eigen split	Delta < 1.25^2	0.996	Depthformer
Depth Estimation	KITTI Eigen split	Delta < 1.25^3	0.999	Depthformer
Depth Estimation	KITTI Eigen split	RMSE	2.285	Depthformer
Depth Estimation	KITTI Eigen split	RMSE log	0.087	Depthformer
Depth Estimation	KITTI Eigen split	Sq Rel	0.187	Depthformer
Depth Estimation	KITTI Eigen split	absolute relative error	0.058	Depthformer
3D	NYU-Depth V2	Delta < 1.25	0.913	Depthformer
3D	NYU-Depth V2	Delta < 1.25^2	0.988	Depthformer
3D	NYU-Depth V2	Delta < 1.25^3	0.997	Depthformer
3D	NYU-Depth V2	RMSE	0.345	Depthformer
3D	NYU-Depth V2	absolute relative error	0.1	Depthformer
3D	NYU-Depth V2	log 10	0.042	Depthformer
3D	KITTI Eigen split	Delta < 1.25	0.967	Depthformer
3D	KITTI Eigen split	Delta < 1.25^2	0.996	Depthformer
3D	KITTI Eigen split	Delta < 1.25^3	0.999	Depthformer
3D	KITTI Eigen split	RMSE	2.285	Depthformer
3D	KITTI Eigen split	RMSE log	0.087	Depthformer
3D	KITTI Eigen split	Sq Rel	0.187	Depthformer
3D	KITTI Eigen split	absolute relative error	0.058	Depthformer

Depthformer : Multiscale Vision Transformer For Monocular Depth Estimation With Local Global Information Fusion

Abstract

Results

Related Papers

Depthformer : Multiscale Vision Transformer For Monocular Depth Estimation With Local Global Information Fusion

Abstract

Results

Related Papers