Boyong He, Yuxiang Ji, Zhuoyue Tan, Liaoni Wu
Object detectors often suffer a decrease in performance due to the large domain gap between the training data (source domain) and real-world data (target domain). Diffusion-based generative models have shown remarkable abilities in generating high-quality and diverse images, suggesting their potential for extracting valuable feature from various domains. To effectively leverage the cross-domain feature representation of diffusion models, in this paper, we train a detector with frozen-weight diffusion model on the source domain, then employ it as a teacher model to generate pseudo labels on the unlabeled target domain, which are used to guide the supervised learning of the student model on the target domain. We refer to this approach as Diffusion Domain Teacher (DDT). By employing this straightforward yet potent framework, we significantly improve cross-domain object detection performance without compromising the inference speed. Our method achieves an average mAP improvement of 21.2% compared to the baseline on 6 datasets from three common cross-domain detection benchmarks (Cross-Camera, Syn2Real, Real2Artistic}, surpassing the current state-of-the-art (SOTA) methods by an average of 5.7% mAP. Furthermore, extensive experiments demonstrate that our method consistently brings improvements even in more powerful and complex models, highlighting broadly applicable and effective domain adaptation capability of our DDT. The code is available at https://github.com/heboyong/Diffusion-Domain-Teacher.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Domain Adaptation | Comic2k | mAP | 50.2 | DDT |
| Domain Adaptation | Cityscapes to Foggy Cityscapes | mAP@0.5 | 50 | DDT |
| Domain Adaptation | BDD100k to Cityscapes | mAP | 43.4 | DDT(R-101) |
| Domain Adaptation | SIM10K to Cityscapes | mAP@0.5 | 64 | DDT |
| Domain Adaptation | SIM10K to BDD100K | mAP@0.5 | 58.3 | DDT |
| Domain Adaptation | Pascal VOC to Clipart1K | mAP | 55.6 | DDT |
| Object Detection | PASCAL VOC to Watercolor2k | mAp | 63.7 | DDT |
| Object Detection | PASCAL VOC to Comic2k | mAP | 50.2 | DDT |
| Object Detection | Pascal VOC to Clipart1K | mAP | 55.6 | DDT |
| 3D | PASCAL VOC to Watercolor2k | mAp | 63.7 | DDT |
| 3D | PASCAL VOC to Comic2k | mAP | 50.2 | DDT |
| 3D | Pascal VOC to Clipart1K | mAP | 55.6 | DDT |
| Unsupervised Domain Adaptation | Cityscapes to Foggy Cityscapes | mAP@0.5 | 50 | DDT |
| Unsupervised Domain Adaptation | BDD100k to Cityscapes | mAP | 43.4 | DDT(R-101) |
| Unsupervised Domain Adaptation | SIM10K to Cityscapes | mAP@0.5 | 64 | DDT |
| Unsupervised Domain Adaptation | SIM10K to BDD100K | mAP@0.5 | 58.3 | DDT |
| Unsupervised Domain Adaptation | Pascal VOC to Clipart1K | mAP | 55.6 | DDT |
| 2D Classification | PASCAL VOC to Watercolor2k | mAp | 63.7 | DDT |
| 2D Classification | PASCAL VOC to Comic2k | mAP | 50.2 | DDT |
| 2D Classification | Pascal VOC to Clipart1K | mAP | 55.6 | DDT |
| 2D Object Detection | PASCAL VOC to Watercolor2k | mAp | 63.7 | DDT |
| 2D Object Detection | PASCAL VOC to Comic2k | mAP | 50.2 | DDT |
| 2D Object Detection | Pascal VOC to Clipart1K | mAP | 55.6 | DDT |
| 16k | PASCAL VOC to Watercolor2k | mAp | 63.7 | DDT |
| 16k | PASCAL VOC to Comic2k | mAP | 50.2 | DDT |
| 16k | Pascal VOC to Clipart1K | mAP | 55.6 | DDT |