Pardis Taghavi, Reza Langari, Gaurav Pandey
This research paper presents an innovative multi-task learning framework that allows concurrent depth estimation and semantic segmentation using a single camera. The proposed approach is based on a shared encoder-decoder architecture, which integrates various techniques to improve the accuracy of the depth estimation and semantic segmentation task without compromising computational efficiency. Additionally, the paper incorporates an adversarial training component, employing a Wasserstein GAN framework with a critic network, to refine model's predictions. The framework is thoroughly evaluated on two datasets - the outdoor Cityscapes dataset and the indoor NYU Depth V2 dataset - and it outperforms existing state-of-the-art methods in both segmentation and depth estimation tasks. We also conducted ablation studies to analyze the contributions of different components, including pre-training strategies, the inclusion of critics, the use of logarithmic depth scaling, and advanced image augmentations, to provide a better understanding of the proposed framework. The accompanying source code is accessible at \url{https://github.com/PardisTaghavi/SwinMTL}.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Depth Estimation | Cityscapes test | RMSE | 6.352 | SwinMTL |
| Depth Estimation | Cityscapes | Absolute relative error (AbsRel) | 0.089 | SwinMTL |
| Depth Estimation | Cityscapes | RMSE | 5.481 | SwinMTL |
| Depth Estimation | Cityscapes | RMSE log | 0.139 | SwinMTL |
| Depth Estimation | Cityscapes | Square relative error (SqRel) | 1.051 | SwinMTL |
| Transfer Learning | NYUv2 | Mean IoU | 58.14 | SwinMTL |
| Transfer Learning | Cityscapes test | RMSE | 0.51 | SwinMTL |
| Transfer Learning | Cityscapes test | mIoU | 76.41 | SwinMTL |
| Semantic Segmentation | Cityscapes val | mIoU | 76.41 | SwinMTL |
| 3D | Cityscapes test | RMSE | 6.352 | SwinMTL |
| 3D | Cityscapes | Absolute relative error (AbsRel) | 0.089 | SwinMTL |
| 3D | Cityscapes | RMSE | 5.481 | SwinMTL |
| 3D | Cityscapes | RMSE log | 0.139 | SwinMTL |
| 3D | Cityscapes | Square relative error (SqRel) | 1.051 | SwinMTL |
| Multi-Task Learning | NYUv2 | Mean IoU | 58.14 | SwinMTL |
| Multi-Task Learning | Cityscapes test | RMSE | 0.51 | SwinMTL |
| Multi-Task Learning | Cityscapes test | mIoU | 76.41 | SwinMTL |
| 10-shot image generation | Cityscapes val | mIoU | 76.41 | SwinMTL |