TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Methods/SPEED

SPEED

SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings

Introduced 20009576 papers

Description

The monocular depth estimation (MDE) is the task of estimating depth from a single frame. This information is an essential knowledge in many computer vision tasks such as scene understanding and visual odometry, which are key components in autonomous and robotic systems. Approaches based on the state of the art vision transformer architectures are extremely deep and complex not suitable for real-time inference operations on edge and autonomous systems equipped with low resources (i.e. robot indoor navigation and surveillance). This paper presents SPEED, a Separable Pyramidal pooling EncodEr-Decoder architecture designed to achieve real-time frequency performances on multiple hardware platforms. The proposed model is a fast-throughput deep architecture for MDE able to obtain depth estimations with high accuracy from low resolution images using minimum hardware resources (i.e. edge devices). Our encoder-decoder model exploits two depthwise separable pyramidal pooling layers, which allow to increase the inference frequency while reducing the overall computational complexity. The proposed method performs better than other fast-throughput architectures in terms of both accuracy and frame rates, achieving real-time performances over cloud CPU, TPU and the NVIDIA Jetson TX1 on two indoor benchmarks: the NYU Depth v2 and the DIML Kinect v2 datasets.

Papers Using This Method

Developing Visual Augmented Q&A System using Scalable Vision Embedding Retrieval & Late Interaction Re-ranker2025-07-16Interpretable Bayesian Tensor Network Kernel Machines with Automatic Rank and Feature Selection2025-07-15COLI: A Hierarchical Efficient Compressor for Large Images2025-07-15Tomato Multi-Angle Multi-Pose Dataset for Fine-Grained Phenotyping2025-07-15Streaming 4D Visual Geometry Transformer2025-07-15Neurosymbolic Reasoning Shortcuts under the Independence Assumption2025-07-15Federated Learning with Graph-Based Aggregation for Traffic Forecasting2025-07-13Lizard: An Efficient Linearization Framework for Large Language Models2025-07-11GNN-CNN: An Efficient Hybrid Model of Convolutional and Graph Neural Networks for Text Representation2025-07-10Robust One-step Speech Enhancement via Consistency Distillation2025-07-08GSVR: 2D Gaussian-based Video Representation for 800+ FPS with Hybrid Deformation Field2025-07-08Hyperspectral Anomaly Detection Methods: A Survey and Comparative Study2025-07-08Acquiring and Adapting Priors for Novel Tasks via Neural Meta-Architectures2025-07-07MambaFusion: Height-Fidelity Dense Global Fusion for Multi-modal 3D Object Detection2025-07-06OrthoRank: Token Selection via Sink Token Orthogonality for Efficient LLM inference2025-07-05Hita: Holistic Tokenizer for Autoregressive Image Generation2025-07-03CycleVAR: Repurposing Autoregressive Model for Unsupervised One-Step Image Translation2025-06-29Deterministic Object Pose Confidence Region Estimation2025-06-28SAM4D: Segment Anything in Camera and LiDAR Streams2025-06-26Flow-Based Single-Step Completion for Efficient and Expressive Policy Learning2025-06-26