Runfa Chen, Yu Rong, Shangmin Guo, Jiaqi Han, Fuchun Sun, Tingyang Xu, Wenbing Huang
After the great success of Vision Transformer variants (ViTs) in computer vision, it has also demonstrated great potential in domain adaptive semantic segmentation. Unfortunately, straightforwardly applying local ViTs in domain adaptive semantic segmentation does not bring in expected improvement. We find that the pitfall of local ViTs is due to the severe high-frequency components generated during both the pseudo-label construction and features alignment for target domains. These high-frequency components make the training of local ViTs very unsmooth and hurt their transferability. In this paper, we introduce a low-pass filtering mechanism, momentum network, to smooth the learning dynamics of target domain features and pseudo labels. Furthermore, we propose a dynamic of discrepancy measurement to align the distributions in the source and target domains via dynamic weights to evaluate the importance of the samples. After tackling the above issues, extensive experiments on sim2real benchmarks show that the proposed method outperforms the state-of-the-art methods. Our codes are available at https://github.com/alpc91/TransDA
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Image-to-Image Translation | SYNTHIA-to-Cityscapes | mIoU (13 classes) | 66.3 | TransDA-B |
| Image-to-Image Translation | GTAV-to-Cityscapes Labels | mIoU | 63.9 | TransDA-B |
| Image-to-Image Translation | GTAV-to-Cityscapes Labels | mIoU | 63.9 | TransDA-B |
| Image-to-Image Translation | SYNTHIA-to-Cityscapes | MIoU (13 classes) | 66.3 | TransDA-B |
| Image-to-Image Translation | SYNTHIA-to-Cityscapes | MIoU (16 classes) | 59.3 | TransDA-B |
| Domain Adaptation | GTA5 to Cityscapes | mIoU | 63.9 | TransDA-B |
| Domain Adaptation | GTAV-to-Cityscapes Labels | mIoU | 63.9 | TransDA-B |
| Domain Adaptation | SYNTHIA-to-Cityscapes | mIoU (13 classes) | 66.3 | TransDA-B |
| Image Generation | SYNTHIA-to-Cityscapes | mIoU (13 classes) | 66.3 | TransDA-B |
| Image Generation | GTAV-to-Cityscapes Labels | mIoU | 63.9 | TransDA-B |
| Image Generation | GTAV-to-Cityscapes Labels | mIoU | 63.9 | TransDA-B |
| Image Generation | SYNTHIA-to-Cityscapes | MIoU (13 classes) | 66.3 | TransDA-B |
| Image Generation | SYNTHIA-to-Cityscapes | MIoU (16 classes) | 59.3 | TransDA-B |
| Semantic Segmentation | GTAV-to-Cityscapes Labels | mIoU | 63.9 | TransDA-B |
| Semantic Segmentation | SYNTHIA-to-Cityscapes | Mean IoU | 59.3 | TransDA-B |
| Unsupervised Domain Adaptation | GTAV-to-Cityscapes Labels | mIoU | 63.9 | TransDA-B |
| Unsupervised Domain Adaptation | SYNTHIA-to-Cityscapes | mIoU (13 classes) | 66.3 | TransDA-B |
| 10-shot image generation | GTAV-to-Cityscapes Labels | mIoU | 63.9 | TransDA-B |
| 10-shot image generation | SYNTHIA-to-Cityscapes | Mean IoU | 59.3 | TransDA-B |
| 1 Image, 2*2 Stitching | SYNTHIA-to-Cityscapes | mIoU (13 classes) | 66.3 | TransDA-B |
| 1 Image, 2*2 Stitching | GTAV-to-Cityscapes Labels | mIoU | 63.9 | TransDA-B |
| 1 Image, 2*2 Stitching | GTAV-to-Cityscapes Labels | mIoU | 63.9 | TransDA-B |
| 1 Image, 2*2 Stitching | SYNTHIA-to-Cityscapes | MIoU (13 classes) | 66.3 | TransDA-B |
| 1 Image, 2*2 Stitching | SYNTHIA-to-Cityscapes | MIoU (16 classes) | 59.3 | TransDA-B |