Lukas Hoyer, Dengxin Dai, Luc van Gool
Unsupervised domain adaptation (UDA) aims to adapt a model trained on the source domain (e.g. synthetic data) to the target domain (e.g. real-world data) without requiring further annotations on the target domain. This work focuses on UDA for semantic segmentation as real-world pixel-wise annotations are particularly expensive to acquire. As UDA methods for semantic segmentation are usually GPU memory intensive, most previous methods operate only on downscaled images. We question this design as low-resolution predictions often fail to preserve fine details. The alternative of training with random crops of high-resolution images alleviates this problem but falls short in capturing long-range, domain-robust context information. Therefore, we propose HRDA, a multi-resolution training approach for UDA, that combines the strengths of small high-resolution crops to preserve fine segmentation details and large low-resolution crops to capture long-range context dependencies with a learned scale attention, while maintaining a manageable GPU memory footprint. HRDA enables adapting small objects and preserving fine segmentation details. It significantly improves the state-of-the-art performance by 5.5 mIoU for GTA-to-Cityscapes and 4.9 mIoU for Synthia-to-Cityscapes, resulting in unprecedented 73.8 and 65.8 mIoU, respectively. The implementation is available at https://github.com/lhoyer/HRDA.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Image-to-Image Translation | SYNTHIA-to-Cityscapes | mIoU (13 classes) | 72.4 | HRDA |
| Image-to-Image Translation | GTAV-to-Cityscapes Labels | mIoU | 73.8 | HRDA |
| Image-to-Image Translation | GTAV-to-Cityscapes Labels | mIoU | 73.8 | HRDA |
| Image-to-Image Translation | SYNTHIA-to-Cityscapes | MIoU (13 classes) | 72.4 | HRDA |
| Image-to-Image Translation | SYNTHIA-to-Cityscapes | MIoU (16 classes) | 65.8 | HRDA |
| Domain Adaptation | SYNTHIA-to-Cityscapes | mIoU | 65.8 | HRDA |
| Domain Adaptation | GTA5 to Cityscapes | mIoU | 73.8 | HRDA |
| Domain Adaptation | Cityscapes to ACDC | mIoU | 68 | HRDA |
| Domain Adaptation | GTAV-to-Cityscapes Labels | mIoU | 73.8 | HRDA |
| Domain Adaptation | SYNTHIA-to-Cityscapes | mIoU | 65.8 | HRDA |
| Domain Adaptation | SYNTHIA-to-Cityscapes | mIoU (13 classes) | 72.4 | HRDA |
| Domain Adaptation | GTA-to-Avg(Cityscapes,BDD,Mapillary) | mIoU | 55.9 | HRDA |
| Image Generation | SYNTHIA-to-Cityscapes | mIoU (13 classes) | 72.4 | HRDA |
| Image Generation | GTAV-to-Cityscapes Labels | mIoU | 73.8 | HRDA |
| Image Generation | GTAV-to-Cityscapes Labels | mIoU | 73.8 | HRDA |
| Image Generation | SYNTHIA-to-Cityscapes | MIoU (13 classes) | 72.4 | HRDA |
| Image Generation | SYNTHIA-to-Cityscapes | MIoU (16 classes) | 65.8 | HRDA |
| Semantic Segmentation | Dark Zurich | mIoU | 55.9 | HRDA |
| Semantic Segmentation | GTAV-to-Cityscapes Labels | mIoU | 73.8 | HRDA |
| Semantic Segmentation | SYNTHIA-to-Cityscapes | Mean IoU | 65.8 | HRDA |
| Unsupervised Domain Adaptation | GTAV-to-Cityscapes Labels | mIoU | 73.8 | HRDA |
| Unsupervised Domain Adaptation | SYNTHIA-to-Cityscapes | mIoU | 65.8 | HRDA |
| Unsupervised Domain Adaptation | SYNTHIA-to-Cityscapes | mIoU (13 classes) | 72.4 | HRDA |
| Domain Generalization | GTA-to-Avg(Cityscapes,BDD,Mapillary) | mIoU | 55.9 | HRDA |
| 10-shot image generation | Dark Zurich | mIoU | 55.9 | HRDA |
| 10-shot image generation | GTAV-to-Cityscapes Labels | mIoU | 73.8 | HRDA |
| 10-shot image generation | SYNTHIA-to-Cityscapes | Mean IoU | 65.8 | HRDA |
| 1 Image, 2*2 Stitching | SYNTHIA-to-Cityscapes | mIoU (13 classes) | 72.4 | HRDA |
| 1 Image, 2*2 Stitching | GTAV-to-Cityscapes Labels | mIoU | 73.8 | HRDA |
| 1 Image, 2*2 Stitching | GTAV-to-Cityscapes Labels | mIoU | 73.8 | HRDA |
| 1 Image, 2*2 Stitching | SYNTHIA-to-Cityscapes | MIoU (13 classes) | 72.4 | HRDA |
| 1 Image, 2*2 Stitching | SYNTHIA-to-Cityscapes | MIoU (16 classes) | 65.8 | HRDA |