Harsh Maheshwari, Yen-Cheng Liu, Zsolt Kira
Using multiple spatial modalities has been proven helpful in improving semantic segmentation performance. However, there are several real-world challenges that have yet to be addressed: (a) improving label efficiency and (b) enhancing robustness in realistic scenarios where modalities are missing at the test time. To address these challenges, we first propose a simple yet efficient multi-modal fusion mechanism Linear Fusion, that performs better than the state-of-the-art multi-modal models even with limited supervision. Second, we propose M3L: Multi-modal Teacher for Masked Modality Learning, a semi-supervised framework that not only improves the multi-modal performance but also makes the model robust to the realistic missing modality scenario using unlabeled data. We create the first benchmark for semi-supervised multi-modal semantic segmentation and also report the robustness to missing modalities. Our proposal shows an absolute improvement of up to 10% on robust mIoU above the most competitive baselines. Our code is available at https://github.com/harshm121/M3L
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Semantic Segmentation | SUN-RGBD | Mean IoU (test) | 48.17 | DFormer-L |
| Semantic Segmentation | Stanford2D3D - RGBD | mIoU | 57.16 | Linear Fusion (Segformer B2) |
| Semantic Segmentation | 2D-3D-S | mIoU (0.1% labels) | 40.05 | M3L (Linear Fusion B2) |
| Semantic Segmentation | 2D-3D-S | mIoU (0.2% labels) | 44.62 | M3L (Linear Fusion B2) |
| Semantic Segmentation | 2D-3D-S | mIoU (1% labels) | 49.28 | M3L (Linear Fusion B2) |
| Semantic Segmentation | Stanford 2D-3D | MM-Robust mIoU (0.1% labels) | 41.36 | M3L (Linear Fusion - Segformer B2) |
| Semantic Segmentation | Stanford 2D-3D | mIoU (0.1% labels) | 44.1 | M3L (Linear Fusion - Segformer B2) |
| Semantic Segmentation | Stanford 2D-3D | mIoU (0.1% labels) | 41.7 | Mean Teacher (Linear Fusion - Segformer B2) |
| 10-shot image generation | SUN-RGBD | Mean IoU (test) | 48.17 | DFormer-L |
| 10-shot image generation | Stanford2D3D - RGBD | mIoU | 57.16 | Linear Fusion (Segformer B2) |
| 10-shot image generation | 2D-3D-S | mIoU (0.1% labels) | 40.05 | M3L (Linear Fusion B2) |
| 10-shot image generation | 2D-3D-S | mIoU (0.2% labels) | 44.62 | M3L (Linear Fusion B2) |
| 10-shot image generation | 2D-3D-S | mIoU (1% labels) | 49.28 | M3L (Linear Fusion B2) |
| 10-shot image generation | Stanford 2D-3D | MM-Robust mIoU (0.1% labels) | 41.36 | M3L (Linear Fusion - Segformer B2) |
| 10-shot image generation | Stanford 2D-3D | mIoU (0.1% labels) | 44.1 | M3L (Linear Fusion - Segformer B2) |
| 10-shot image generation | Stanford 2D-3D | mIoU (0.1% labels) | 41.7 | Mean Teacher (Linear Fusion - Segformer B2) |