Attention Attention Everywhere: Monocular Depth Prediction with Skip Attention

Ashutosh Agarwal, Chetan Arora

2022-10-17Depth Prediction Prediction Depth Estimation Monocular Depth Estimation

Abstract

Monocular Depth Estimation (MDE) aims to predict pixel-wise depth given a single RGB image. For both, the convolutional as well as the recent attention-based models, encoder-decoder-based architectures have been found to be useful due to the simultaneous requirement of global context and pixel-level resolution. Typically, a skip connection module is used to fuse the encoder and decoder features, which comprises of feature map concatenation followed by a convolution operation. Inspired by the demonstrated benefits of attention in a multitude of computer vision problems, we propose an attention-based fusion of encoder and decoder features. We pose MDE as a pixel query refinement problem, where coarsest-level encoder features are used to initialize pixel-level queries, which are then refined to higher resolutions by the proposed Skip Attention Module (SAM). We formulate the prediction problem as ordinal regression over the bin centers that discretize the continuous depth range and introduce a Bin Center Predictor (BCP) module that predicts bins at the coarsest level using pixel queries. Apart from the benefit of image adaptive depth binning, the proposed design helps learn improved depth embedding in initial pixel queries via direct supervision from the ground truth. Extensive experiments on the two canonical datasets, NYUV2 and KITTI, show that our architecture outperforms the state-of-the-art by 5.3% and 3.9%, respectively, along with an improved generalization performance by 9.4% on the SUNRGBD dataset. Code is available at https://github.com/ashutosh1807/PixelFormer.git.

Results

Task	Dataset	Metric	Value	Model
Depth Estimation	NYU-Depth V2	Delta < 1.25	0.929	PixelFormer
Depth Estimation	NYU-Depth V2	Delta < 1.25^2	0.991	PixelFormer
Depth Estimation	NYU-Depth V2	Delta < 1.25^3	0.998	PixelFormer
Depth Estimation	NYU-Depth V2	RMSE	0.322	PixelFormer
Depth Estimation	NYU-Depth V2	absolute relative error	0.09	PixelFormer
Depth Estimation	NYU-Depth V2	log 10	0.039	PixelFormer
Depth Estimation	KITTI Eigen split	Delta < 1.25	0.976	PixelFormer
Depth Estimation	KITTI Eigen split	Delta < 1.25^2	0.997	PixelFormer
Depth Estimation	KITTI Eigen split	Delta < 1.25^3	0.999	PixelFormer
Depth Estimation	KITTI Eigen split	RMSE	2.081	PixelFormer
Depth Estimation	KITTI Eigen split	RMSE log	0.077	PixelFormer
Depth Estimation	KITTI Eigen split	Sq Rel	0.149	PixelFormer
Depth Estimation	KITTI Eigen split	absolute relative error	0.051	PixelFormer
3D	NYU-Depth V2	Delta < 1.25	0.929	PixelFormer
3D	NYU-Depth V2	Delta < 1.25^2	0.991	PixelFormer
3D	NYU-Depth V2	Delta < 1.25^3	0.998	PixelFormer
3D	NYU-Depth V2	RMSE	0.322	PixelFormer
3D	NYU-Depth V2	absolute relative error	0.09	PixelFormer
3D	NYU-Depth V2	log 10	0.039	PixelFormer
3D	KITTI Eigen split	Delta < 1.25	0.976	PixelFormer
3D	KITTI Eigen split	Delta < 1.25^2	0.997	PixelFormer
3D	KITTI Eigen split	Delta < 1.25^3	0.999	PixelFormer
3D	KITTI Eigen split	RMSE	2.081	PixelFormer
3D	KITTI Eigen split	RMSE log	0.077	PixelFormer
3D	KITTI Eigen split	Sq Rel	0.149	PixelFormer
3D	KITTI Eigen split	absolute relative error	0.051	PixelFormer

Attention Attention Everywhere: Monocular Depth Prediction with Skip Attention

Abstract

Results

Related Papers

Attention Attention Everywhere: Monocular Depth Prediction with Skip Attention

Abstract

Results

Related Papers