Lukas Hoyer, Dengxin Dai, Haoran Wang, Luc van Gool
In unsupervised domain adaptation (UDA), a model trained on source data (e.g. synthetic) is adapted to target data (e.g. real-world) without access to target annotation. Most previous UDA methods struggle with classes that have a similar visual appearance on the target domain as no ground truth is available to learn the slight appearance differences. To address this problem, we propose a Masked Image Consistency (MIC) module to enhance UDA by learning spatial context relations of the target domain as additional clues for robust visual recognition. MIC enforces the consistency between predictions of masked target images, where random patches are withheld, and pseudo-labels that are generated based on the complete image by an exponential moving average teacher. To minimize the consistency loss, the network has to learn to infer the predictions of the masked regions from their context. Due to its simple and universal concept, MIC can be integrated into various UDA methods across different visual recognition tasks such as image classification, semantic segmentation, and object detection. MIC significantly improves the state-of-the-art performance across the different recognition tasks for synthetic-to-real, day-to-nighttime, and clear-to-adverse-weather UDA. For instance, MIC achieves an unprecedented UDA performance of 75.9 mIoU and 92.8% on GTA-to-Cityscapes and VisDA-2017, respectively, which corresponds to an improvement of +2.1 and +3.0 percent points over the previous state of the art. The implementation is available at https://github.com/lhoyer/MIC.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Image-to-Image Translation | Cityscapes-to-Foggy Cityscapes | mAP | 47.6 | MIC |
| Image-to-Image Translation | SYNTHIA-to-Cityscapes | mIoU (13 classes) | 74 | MIC |
| Image-to-Image Translation | GTAV-to-Cityscapes Labels | mIoU | 75.9 | MIC |
| Image-to-Image Translation | GTAV-to-Cityscapes Labels | mIoU | 75.9 | HRDA+MIC |
| Image-to-Image Translation | SYNTHIA-to-Cityscapes | MIoU (13 classes) | 74 | MIC |
| Image-to-Image Translation | SYNTHIA-to-Cityscapes | MIoU (16 classes) | 67.3 | MIC |
| Domain Adaptation | SYNTHIA-to-Cityscapes | mIoU | 67.3 | MIC |
| Domain Adaptation | GTA5 to Cityscapes | mIoU | 75.9 | MIC |
| Domain Adaptation | Cityscapes to ACDC | mIoU | 70.4 | MIC |
| Domain Adaptation | VisDA2017 | Accuracy | 92.8 | MIC |
| Domain Adaptation | Office-Home | Accuracy | 86.2 | MIC |
| Domain Adaptation | GTAV-to-Cityscapes Labels | mIoU | 75.9 | MIC |
| Domain Adaptation | Cityscapes to Foggy Cityscapes | mAP@0.5 | 47.6 | MIC |
| Domain Adaptation | SYNTHIA-to-Cityscapes | mIoU | 67.3 | MIC |
| Domain Adaptation | SYNTHIA-to-Cityscapes | mIoU (13 classes) | 74 | MIC |
| Image Generation | Cityscapes-to-Foggy Cityscapes | mAP | 47.6 | MIC |
| Image Generation | SYNTHIA-to-Cityscapes | mIoU (13 classes) | 74 | MIC |
| Image Generation | GTAV-to-Cityscapes Labels | mIoU | 75.9 | MIC |
| Image Generation | GTAV-to-Cityscapes Labels | mIoU | 75.9 | HRDA+MIC |
| Image Generation | SYNTHIA-to-Cityscapes | MIoU (13 classes) | 74 | MIC |
| Image Generation | SYNTHIA-to-Cityscapes | MIoU (16 classes) | 67.3 | MIC |
| Semantic Segmentation | Dark Zurich | mIoU | 60.2 | MIC |
| Semantic Segmentation | GTAV-to-Cityscapes Labels | mIoU | 75.9 | MIC |
| Semantic Segmentation | SYNTHIA-to-Cityscapes | Mean IoU | 67.3 | MIC |
| Unsupervised Domain Adaptation | GTAV-to-Cityscapes Labels | mIoU | 75.9 | MIC |
| Unsupervised Domain Adaptation | Cityscapes to Foggy Cityscapes | mAP@0.5 | 47.6 | MIC |
| Unsupervised Domain Adaptation | SYNTHIA-to-Cityscapes | mIoU | 67.3 | MIC |
| Unsupervised Domain Adaptation | SYNTHIA-to-Cityscapes | mIoU (13 classes) | 74 | MIC |
| 10-shot image generation | Dark Zurich | mIoU | 60.2 | MIC |
| 10-shot image generation | GTAV-to-Cityscapes Labels | mIoU | 75.9 | MIC |
| 10-shot image generation | SYNTHIA-to-Cityscapes | Mean IoU | 67.3 | MIC |
| 1 Image, 2*2 Stitching | Cityscapes-to-Foggy Cityscapes | mAP | 47.6 | MIC |
| 1 Image, 2*2 Stitching | SYNTHIA-to-Cityscapes | mIoU (13 classes) | 74 | MIC |
| 1 Image, 2*2 Stitching | GTAV-to-Cityscapes Labels | mIoU | 75.9 | MIC |
| 1 Image, 2*2 Stitching | GTAV-to-Cityscapes Labels | mIoU | 75.9 | HRDA+MIC |
| 1 Image, 2*2 Stitching | SYNTHIA-to-Cityscapes | MIoU (13 classes) | 74 | MIC |
| 1 Image, 2*2 Stitching | SYNTHIA-to-Cityscapes | MIoU (16 classes) | 67.3 | MIC |