TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Attention Attention Everywhere: Monocular Depth Prediction...

Attention Attention Everywhere: Monocular Depth Prediction with Skip Attention

Ashutosh Agarwal, Chetan Arora

2022-10-17Depth PredictionPredictionDepth EstimationMonocular Depth Estimation
PaperPDFCode(official)

Abstract

Monocular Depth Estimation (MDE) aims to predict pixel-wise depth given a single RGB image. For both, the convolutional as well as the recent attention-based models, encoder-decoder-based architectures have been found to be useful due to the simultaneous requirement of global context and pixel-level resolution. Typically, a skip connection module is used to fuse the encoder and decoder features, which comprises of feature map concatenation followed by a convolution operation. Inspired by the demonstrated benefits of attention in a multitude of computer vision problems, we propose an attention-based fusion of encoder and decoder features. We pose MDE as a pixel query refinement problem, where coarsest-level encoder features are used to initialize pixel-level queries, which are then refined to higher resolutions by the proposed Skip Attention Module (SAM). We formulate the prediction problem as ordinal regression over the bin centers that discretize the continuous depth range and introduce a Bin Center Predictor (BCP) module that predicts bins at the coarsest level using pixel queries. Apart from the benefit of image adaptive depth binning, the proposed design helps learn improved depth embedding in initial pixel queries via direct supervision from the ground truth. Extensive experiments on the two canonical datasets, NYUV2 and KITTI, show that our architecture outperforms the state-of-the-art by 5.3% and 3.9%, respectively, along with an improved generalization performance by 9.4% on the SUNRGBD dataset. Code is available at https://github.com/ashutosh1807/PixelFormer.git.

Results

TaskDatasetMetricValueModel
Depth EstimationNYU-Depth V2Delta < 1.250.929PixelFormer
Depth EstimationNYU-Depth V2Delta < 1.25^20.991PixelFormer
Depth EstimationNYU-Depth V2Delta < 1.25^30.998PixelFormer
Depth EstimationNYU-Depth V2RMSE0.322PixelFormer
Depth EstimationNYU-Depth V2absolute relative error0.09PixelFormer
Depth EstimationNYU-Depth V2log 100.039PixelFormer
Depth EstimationKITTI Eigen splitDelta < 1.250.976PixelFormer
Depth EstimationKITTI Eigen splitDelta < 1.25^20.997PixelFormer
Depth EstimationKITTI Eigen splitDelta < 1.25^30.999PixelFormer
Depth EstimationKITTI Eigen splitRMSE2.081PixelFormer
Depth EstimationKITTI Eigen splitRMSE log0.077PixelFormer
Depth EstimationKITTI Eigen splitSq Rel0.149PixelFormer
Depth EstimationKITTI Eigen splitabsolute relative error0.051PixelFormer
3DNYU-Depth V2Delta < 1.250.929PixelFormer
3DNYU-Depth V2Delta < 1.25^20.991PixelFormer
3DNYU-Depth V2Delta < 1.25^30.998PixelFormer
3DNYU-Depth V2RMSE0.322PixelFormer
3DNYU-Depth V2absolute relative error0.09PixelFormer
3DNYU-Depth V2log 100.039PixelFormer
3DKITTI Eigen splitDelta < 1.250.976PixelFormer
3DKITTI Eigen splitDelta < 1.25^20.997PixelFormer
3DKITTI Eigen splitDelta < 1.25^30.999PixelFormer
3DKITTI Eigen splitRMSE2.081PixelFormer
3DKITTI Eigen splitRMSE log0.077PixelFormer
3DKITTI Eigen splitSq Rel0.149PixelFormer
3DKITTI Eigen splitabsolute relative error0.051PixelFormer

Related Papers

Multi-Strategy Improved Snake Optimizer Accelerated CNN-LSTM-Attention-Adaboost for Trajectory Prediction2025-07-21$S^2M^2$: Scalable Stereo Matching Model for Reliable Depth Estimation2025-07-17$π^3$: Scalable Permutation-Equivariant Visual Geometry Learning2025-07-17Efficient Calisthenics Skills Classification through Foreground Instance Selection and Depth Estimation2025-07-16Vision-based Perception for Autonomous Vehicles in Obstacle Avoidance Scenarios2025-07-16MonoMVSNet: Monocular Priors Guided Multi-View Stereo Network2025-07-15Generative Click-through Rate Prediction with Applications to Search Advertising2025-07-15Towards Depth Foundation Model: Recent Trends in Vision-Based Depth Estimation2025-07-15