SwinMTL: A Shared Architecture for Simultaneous Depth Estimation and Semantic Segmentation from Monocular Camera Images

Pardis Taghavi, Reza Langari, Gaurav Pandey

2024-03-15Real-Time Semantic Segmentation Segmentation Semantic Segmentation Multi-Task Learning Depth Estimation Monocular Depth Estimation

Paper PDF Code(official)

Abstract

This research paper presents an innovative multi-task learning framework that allows concurrent depth estimation and semantic segmentation using a single camera. The proposed approach is based on a shared encoder-decoder architecture, which integrates various techniques to improve the accuracy of the depth estimation and semantic segmentation task without compromising computational efficiency. Additionally, the paper incorporates an adversarial training component, employing a Wasserstein GAN framework with a critic network, to refine model's predictions. The framework is thoroughly evaluated on two datasets - the outdoor Cityscapes dataset and the indoor NYU Depth V2 dataset - and it outperforms existing state-of-the-art methods in both segmentation and depth estimation tasks. We also conducted ablation studies to analyze the contributions of different components, including pre-training strategies, the inclusion of critics, the use of logarithmic depth scaling, and advanced image augmentations, to provide a better understanding of the proposed framework. The accompanying source code is accessible at \url{https://github.com/PardisTaghavi/SwinMTL}.

Results

Task	Dataset	Metric	Value	Model
Depth Estimation	Cityscapes test	RMSE	6.352	SwinMTL
Depth Estimation	Cityscapes	Absolute relative error (AbsRel)	0.089	SwinMTL
Depth Estimation	Cityscapes	RMSE	5.481	SwinMTL
Depth Estimation	Cityscapes	RMSE log	0.139	SwinMTL
Depth Estimation	Cityscapes	Square relative error (SqRel)	1.051	SwinMTL
Transfer Learning	NYUv2	Mean IoU	58.14	SwinMTL
Transfer Learning	Cityscapes test	RMSE	0.51	SwinMTL
Transfer Learning	Cityscapes test	mIoU	76.41	SwinMTL
Semantic Segmentation	Cityscapes val	mIoU	76.41	SwinMTL
3D	Cityscapes test	RMSE	6.352	SwinMTL
3D	Cityscapes	Absolute relative error (AbsRel)	0.089	SwinMTL
3D	Cityscapes	RMSE	5.481	SwinMTL
3D	Cityscapes	RMSE log	0.139	SwinMTL
3D	Cityscapes	Square relative error (SqRel)	1.051	SwinMTL
Multi-Task Learning	NYUv2	Mean IoU	58.14	SwinMTL
Multi-Task Learning	Cityscapes test	RMSE	0.51	SwinMTL
Multi-Task Learning	Cityscapes test	mIoU	76.41	SwinMTL
10-shot image generation	Cityscapes val	mIoU	76.41	SwinMTL

SwinMTL: A Shared Architecture for Simultaneous Depth Estimation and Semantic Segmentation from Monocular Camera Images

Abstract

Results

Related Papers

SwinMTL: A Shared Architecture for Simultaneous Depth Estimation and Semantic Segmentation from Monocular Camera Images

Abstract

Results

Related Papers