Self-Supervised Multi-Frame Monocular Scene Flow

Junhwa Hur, Stefan Roth

2021-05-05CVPR 2021 1Self-Supervised Learning Scene Flow Estimation

Abstract

Estimating 3D scene flow from a sequence of monocular images has been gaining increased attention due to the simple, economical capture setup. Owing to the severe ill-posedness of the problem, the accuracy of current methods has been limited, especially that of efficient, real-time approaches. In this paper, we introduce a multi-frame monocular scene flow network based on self-supervised learning, improving the accuracy over previous networks while retaining real-time efficiency. Based on an advanced two-frame baseline with a split-decoder design, we propose (i) a multi-frame model using a triple frame input and convolutional LSTM connections, (ii) an occlusion-aware census loss for better accuracy, and (iii) a gradient detaching strategy to improve training stability. On the KITTI dataset, we observe state-of-the-art accuracy among monocular scene flow methods based on self-supervised learning.

Results

Task	Dataset	Metric	Value	Model
Scene Flow Estimation	KITTI 2015 Scene Flow Test	D1-all	30.78	Multi-Mono-SF
Scene Flow Estimation	KITTI 2015 Scene Flow Test	D2-all	34.41	Multi-Mono-SF
Scene Flow Estimation	KITTI 2015 Scene Flow Test	Fl-all	19.54	Multi-Mono-SF
Scene Flow Estimation	KITTI 2015 Scene Flow Test	Runtime (s)	0.063	Multi-Mono-SF
Scene Flow Estimation	KITTI 2015 Scene Flow Test	SF-all	44.04	Multi-Mono-SF
Scene Flow Estimation	KITTI 2015 Scene Flow Training	Runtime (s)	0.063	Multi-Mono-SF
Scene Flow Estimation	KITTI 2015 Scene Flow Training	D1-all	27.33	Multi-Mono-SF
Scene Flow Estimation	KITTI 2015 Scene Flow Training	D2-all	30.44	Multi-Mono-SF
Scene Flow Estimation	KITTI 2015 Scene Flow Training	Fl-all	18.92	Multi-Mono-SF
Scene Flow Estimation	KITTI 2015 Scene Flow Training	SF-all	39.82	Multi-Mono-SF

Related Papers

A Semi-Supervised Learning Method for the Identification of Bad Exposures in Large Imaging Surveys2025-07-17 Self-supervised Learning on Camera Trap Footage Yields a Strong Universal Face Embedder2025-07-14 MoVieS: Motion-Aware 4D Dynamic View Synthesis in One Second2025-07-14 Speech Quality Assessment Model Based on Mixture of Experts: System-Level Performance Enhancement and Utterance-Level Challenge Analysis2025-07-08 World4Drive: End-to-End Autonomous Driving via Intention-aware Physical Latent World Model2025-07-01 ShapeEmbed: a self-supervised learning framework for 2D contour quantification2025-07-01 RetFiner: A Vision-Language Refinement Scheme for Retinal Foundation Models2025-06-27 Boosting Generative Adversarial Transferability with Self-supervised Vision Transformer Features2025-06-26