TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/MAMo: Leveraging Memory and Attention for Monocular Video ...

MAMo: Leveraging Memory and Attention for Monocular Video Depth Estimation

Rajeev Yasarla, Hong Cai, Jisoo Jeong, Yunxiao Shi, Risheek Garrepalli, Fatih Porikli

2023-07-26IEEE/CVF International Conference on Computer Vision (ICCV) 2023 10Depth PredictionDepth EstimationMonocular Depth Estimation
PaperPDF

Abstract

We propose MAMo, a novel memory and attention frame-work for monocular video depth estimation. MAMo can augment and improve any single-image depth estimation networks into video depth estimation models, enabling them to take advantage of the temporal information to predict more accurate depth. In MAMo, we augment model with memory which aids the depth prediction as the model streams through the video. Specifically, the memory stores learned visual and displacement tokens of the previous time instances. This allows the depth network to cross-reference relevant features from the past when predicting depth on the current frame. We introduce a novel scheme to continuously update the memory, optimizing it to keep tokens that correspond with both the past and the present visual information. We adopt attention-based approach to process memory features where we first learn the spatio-temporal relation among the resultant visual and displacement memory tokens using self-attention module. Further, the output features of self-attention are aggregated with the current visual features through cross-attention. The cross-attended features are finally given to a decoder to predict depth on the current frame. Through extensive experiments on several benchmarks, including KITTI, NYU-Depth V2, and DDAD, we show that MAMo consistently improves monocular depth estimation networks and sets new state-of-the-art (SOTA) accuracy. Notably, our MAMo video depth estimation provides higher accuracy with lower latency, when omparing to SOTA cost-volume-based video depth models.

Results

TaskDatasetMetricValueModel
Depth EstimationKITTI Eigen splitDelta < 1.250.977MAMo
Depth EstimationKITTI Eigen splitDelta < 1.25^20.998MAMo
Depth EstimationKITTI Eigen splitDelta < 1.25^30.9995MAMo
Depth EstimationKITTI Eigen splitRMSE1.984MAMo
Depth EstimationKITTI Eigen splitRMSE log0.072MAMo
Depth EstimationKITTI Eigen splitSq Rel0.13MAMo
Depth EstimationKITTI Eigen splitabsolute relative error0.049MAMo
3DKITTI Eigen splitDelta < 1.250.977MAMo
3DKITTI Eigen splitDelta < 1.25^20.998MAMo
3DKITTI Eigen splitDelta < 1.25^30.9995MAMo
3DKITTI Eigen splitRMSE1.984MAMo
3DKITTI Eigen splitRMSE log0.072MAMo
3DKITTI Eigen splitSq Rel0.13MAMo
3DKITTI Eigen splitabsolute relative error0.049MAMo

Related Papers

$S^2M^2$: Scalable Stereo Matching Model for Reliable Depth Estimation2025-07-17$π^3$: Scalable Permutation-Equivariant Visual Geometry Learning2025-07-17Efficient Calisthenics Skills Classification through Foreground Instance Selection and Depth Estimation2025-07-16Vision-based Perception for Autonomous Vehicles in Obstacle Avoidance Scenarios2025-07-16MonoMVSNet: Monocular Priors Guided Multi-View Stereo Network2025-07-15Towards Depth Foundation Model: Recent Trends in Vision-Based Depth Estimation2025-07-15Cameras as Relative Positional Encoding2025-07-14ByDeWay: Boost Your multimodal LLM with DEpth prompting in a Training-Free Way2025-07-11