TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Combining Events and Frames using Recurrent Asynchronous M...

Combining Events and Frames using Recurrent Asynchronous Multimodal Networks for Monocular Depth Prediction

Daniel Gehrig, Michelle Rüegg, Mathias Gehrig, Javier Hidalgo Carrio, Davide Scaramuzza

2021-02-18Depth PredictionDepth EstimationObject DetectionMonocular Depth Estimation
PaperPDFCode(official)

Abstract

Event cameras are novel vision sensors that report per-pixel brightness changes as a stream of asynchronous "events". They offer significant advantages compared to standard cameras due to their high temporal resolution, high dynamic range and lack of motion blur. However, events only measure the varying component of the visual signal, which limits their ability to encode scene context. By contrast, standard cameras measure absolute intensity frames, which capture a much richer representation of the scene. Both sensors are thus complementary. However, due to the asynchronous nature of events, combining them with synchronous images remains challenging, especially for learning-based methods. This is because traditional recurrent neural networks (RNNs) are not designed for asynchronous and irregular data from additional sensors. To address this challenge, we introduce Recurrent Asynchronous Multimodal (RAM) networks, which generalize traditional RNNs to handle asynchronous and irregular data from multiple sensors. Inspired by traditional RNNs, RAM networks maintain a hidden state that is updated asynchronously and can be queried at any time to generate a prediction. We apply this novel architecture to monocular depth estimation with events and frames where we show an improvement over state-of-the-art methods by up to 30% in terms of mean absolute depth error. To enable further research on multimodal learning with events, we release EventScape, a new dataset with events, intensity frames, semantic labels, and depth maps recorded in the CARLA simulator.

Results

TaskDatasetMetricValueModel
Object DetectionDSECmAP17.6RAMNet
Object DetectionPKU-DDD17-Car mAP5079.6RAMNet
3DDSECmAP17.6RAMNet
3DPKU-DDD17-Car mAP5079.6RAMNet
2D ClassificationDSECmAP17.6RAMNet
2D ClassificationPKU-DDD17-Car mAP5079.6RAMNet
2D Object DetectionDSECmAP17.6RAMNet
2D Object DetectionPKU-DDD17-Car mAP5079.6RAMNet
16kDSECmAP17.6RAMNet
16kPKU-DDD17-Car mAP5079.6RAMNet

Related Papers

$S^2M^2$: Scalable Stereo Matching Model for Reliable Depth Estimation2025-07-17$π^3$: Scalable Permutation-Equivariant Visual Geometry Learning2025-07-17A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17RS-TinyNet: Stage-wise Feature Fusion Network for Detecting Tiny Objects in Remote Sensing Images2025-07-17Decoupled PROB: Decoupled Query Initialization Tasks and Objectness-Class Learning for Open World Object Detection2025-07-17Dual LiDAR-Based Traffic Movement Count Estimation at a Signalized Intersection: Deployment, Data Collection, and Preliminary Analysis2025-07-17Efficient Calisthenics Skills Classification through Foreground Instance Selection and Depth Estimation2025-07-16Vision-based Perception for Autonomous Vehicles in Obstacle Avoidance Scenarios2025-07-16