Hao Zhang, Feng Li, Shilong Liu, Lei Zhang, Hang Su, Jun Zhu, Lionel M. Ni, Heung-Yeung Shum
We present DINO (\textbf{D}ETR with \textbf{I}mproved de\textbf{N}oising anch\textbf{O}r boxes), a state-of-the-art end-to-end object detector. % in this paper. DINO improves over previous DETR-like models in performance and efficiency by using a contrastive way for denoising training, a mixed query selection method for anchor initialization, and a look forward twice scheme for box prediction. DINO achieves $49.4$AP in $12$ epochs and $51.3$AP in $24$ epochs on COCO with a ResNet-50 backbone and multi-scale features, yielding a significant improvement of $\textbf{+6.0}$\textbf{AP} and $\textbf{+2.7}$\textbf{AP}, respectively, compared to DN-DETR, the previous best DETR-like model. DINO scales well in both model size and data size. Without bells and whistles, after pre-training on the Objects365 dataset with a SwinL backbone, DINO obtains the best results on both COCO \texttt{val2017} ($\textbf{63.2}$\textbf{AP}) and \texttt{test-dev} (\textbf{$\textbf{63.3}$AP}). Compared to other models on the leaderboard, DINO significantly reduces its model size and pre-training data size while achieving better results. Our code will be available at \url{https://github.com/IDEACVR/DINO}.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Object Detection | COCO test-dev | box mAP | 63.3 | DINO (Swin-L,multi-scale, TTA) |
| Object Detection | COCO-O | Average mAP | 42.1 | DINO (Swin-L) |
| Object Detection | COCO-O | Effective Robustness | 15.76 | DINO (Swin-L) |
| Object Detection | SA-Det-100k | AP | 43.7 | DINO (ResNet50 1x VFL) |
| Object Detection | SA-Det-100k | AP50 | 52 | DINO (ResNet50 1x VFL) |
| Object Detection | SA-Det-100k | AP75 | 47.7 | DINO (ResNet50 1x VFL) |
| Object Detection | SA-Det-100k | APL | 61.5 | DINO (ResNet50 1x VFL) |
| Object Detection | SA-Det-100k | APM | 43 | DINO (ResNet50 1x VFL) |
| Object Detection | SA-Det-100k | APS | 5.8 | DINO (ResNet50 1x VFL) |
| Object Detection | COCO minival | box AP | 63.2 | DINO (Swin-L) |
| Object Detection | COCO minival | AP50 | 69.1 | DINO-5scale (24 epoch) |
| Object Detection | COCO minival | AP75 | 56 | DINO-5scale (24 epoch) |
| Object Detection | COCO minival | APL | 65.8 | DINO-5scale (24 epoch) |
| Object Detection | COCO minival | APM | 54.2 | DINO-5scale (24 epoch) |
| Object Detection | COCO minival | APS | 34.5 | DINO-5scale (24 epoch) |
| Object Detection | COCO minival | box AP | 51.3 | DINO-5scale (24 epoch) |
| Object Detection | COCO minival | AP50 | 69 | DINO-5scale (36 epoch) |
| Object Detection | COCO minival | AP75 | 55.8 | DINO-5scale (36 epoch) |
| Object Detection | COCO minival | APL | 65.3 | DINO-5scale (36 epoch) |
| Object Detection | COCO minival | APM | 54.3 | DINO-5scale (36 epoch) |
| Object Detection | COCO minival | APS | 35 | DINO-5scale (36 epoch) |
| Object Detection | COCO minival | box AP | 51.2 | DINO-5scale (36 epoch) |
| 3D | COCO test-dev | box mAP | 63.3 | DINO (Swin-L,multi-scale, TTA) |
| 3D | COCO-O | Average mAP | 42.1 | DINO (Swin-L) |
| 3D | COCO-O | Effective Robustness | 15.76 | DINO (Swin-L) |
| 3D | SA-Det-100k | AP | 43.7 | DINO (ResNet50 1x VFL) |
| 3D | SA-Det-100k | AP50 | 52 | DINO (ResNet50 1x VFL) |
| 3D | SA-Det-100k | AP75 | 47.7 | DINO (ResNet50 1x VFL) |
| 3D | SA-Det-100k | APL | 61.5 | DINO (ResNet50 1x VFL) |
| 3D | SA-Det-100k | APM | 43 | DINO (ResNet50 1x VFL) |
| 3D | SA-Det-100k | APS | 5.8 | DINO (ResNet50 1x VFL) |
| 3D | COCO minival | box AP | 63.2 | DINO (Swin-L) |
| 3D | COCO minival | AP50 | 69.1 | DINO-5scale (24 epoch) |
| 3D | COCO minival | AP75 | 56 | DINO-5scale (24 epoch) |
| 3D | COCO minival | APL | 65.8 | DINO-5scale (24 epoch) |
| 3D | COCO minival | APM | 54.2 | DINO-5scale (24 epoch) |
| 3D | COCO minival | APS | 34.5 | DINO-5scale (24 epoch) |
| 3D | COCO minival | box AP | 51.3 | DINO-5scale (24 epoch) |
| 3D | COCO minival | AP50 | 69 | DINO-5scale (36 epoch) |
| 3D | COCO minival | AP75 | 55.8 | DINO-5scale (36 epoch) |
| 3D | COCO minival | APL | 65.3 | DINO-5scale (36 epoch) |
| 3D | COCO minival | APM | 54.3 | DINO-5scale (36 epoch) |
| 3D | COCO minival | APS | 35 | DINO-5scale (36 epoch) |
| 3D | COCO minival | box AP | 51.2 | DINO-5scale (36 epoch) |
| 2D Classification | COCO test-dev | box mAP | 63.3 | DINO (Swin-L,multi-scale, TTA) |
| 2D Classification | COCO-O | Average mAP | 42.1 | DINO (Swin-L) |
| 2D Classification | COCO-O | Effective Robustness | 15.76 | DINO (Swin-L) |
| 2D Classification | SA-Det-100k | AP | 43.7 | DINO (ResNet50 1x VFL) |
| 2D Classification | SA-Det-100k | AP50 | 52 | DINO (ResNet50 1x VFL) |
| 2D Classification | SA-Det-100k | AP75 | 47.7 | DINO (ResNet50 1x VFL) |
| 2D Classification | SA-Det-100k | APL | 61.5 | DINO (ResNet50 1x VFL) |
| 2D Classification | SA-Det-100k | APM | 43 | DINO (ResNet50 1x VFL) |
| 2D Classification | SA-Det-100k | APS | 5.8 | DINO (ResNet50 1x VFL) |
| 2D Classification | COCO minival | box AP | 63.2 | DINO (Swin-L) |
| 2D Classification | COCO minival | AP50 | 69.1 | DINO-5scale (24 epoch) |
| 2D Classification | COCO minival | AP75 | 56 | DINO-5scale (24 epoch) |
| 2D Classification | COCO minival | APL | 65.8 | DINO-5scale (24 epoch) |
| 2D Classification | COCO minival | APM | 54.2 | DINO-5scale (24 epoch) |
| 2D Classification | COCO minival | APS | 34.5 | DINO-5scale (24 epoch) |
| 2D Classification | COCO minival | box AP | 51.3 | DINO-5scale (24 epoch) |
| 2D Classification | COCO minival | AP50 | 69 | DINO-5scale (36 epoch) |
| 2D Classification | COCO minival | AP75 | 55.8 | DINO-5scale (36 epoch) |
| 2D Classification | COCO minival | APL | 65.3 | DINO-5scale (36 epoch) |
| 2D Classification | COCO minival | APM | 54.3 | DINO-5scale (36 epoch) |
| 2D Classification | COCO minival | APS | 35 | DINO-5scale (36 epoch) |
| 2D Classification | COCO minival | box AP | 51.2 | DINO-5scale (36 epoch) |
| 2D Object Detection | COCO test-dev | box mAP | 63.3 | DINO (Swin-L,multi-scale, TTA) |
| 2D Object Detection | COCO-O | Average mAP | 42.1 | DINO (Swin-L) |
| 2D Object Detection | COCO-O | Effective Robustness | 15.76 | DINO (Swin-L) |
| 2D Object Detection | SA-Det-100k | AP | 43.7 | DINO (ResNet50 1x VFL) |
| 2D Object Detection | SA-Det-100k | AP50 | 52 | DINO (ResNet50 1x VFL) |
| 2D Object Detection | SA-Det-100k | AP75 | 47.7 | DINO (ResNet50 1x VFL) |
| 2D Object Detection | SA-Det-100k | APL | 61.5 | DINO (ResNet50 1x VFL) |
| 2D Object Detection | SA-Det-100k | APM | 43 | DINO (ResNet50 1x VFL) |
| 2D Object Detection | SA-Det-100k | APS | 5.8 | DINO (ResNet50 1x VFL) |
| 2D Object Detection | COCO minival | box AP | 63.2 | DINO (Swin-L) |
| 2D Object Detection | COCO minival | AP50 | 69.1 | DINO-5scale (24 epoch) |
| 2D Object Detection | COCO minival | AP75 | 56 | DINO-5scale (24 epoch) |
| 2D Object Detection | COCO minival | APL | 65.8 | DINO-5scale (24 epoch) |
| 2D Object Detection | COCO minival | APM | 54.2 | DINO-5scale (24 epoch) |
| 2D Object Detection | COCO minival | APS | 34.5 | DINO-5scale (24 epoch) |
| 2D Object Detection | COCO minival | box AP | 51.3 | DINO-5scale (24 epoch) |
| 2D Object Detection | COCO minival | AP50 | 69 | DINO-5scale (36 epoch) |
| 2D Object Detection | COCO minival | AP75 | 55.8 | DINO-5scale (36 epoch) |
| 2D Object Detection | COCO minival | APL | 65.3 | DINO-5scale (36 epoch) |
| 2D Object Detection | COCO minival | APM | 54.3 | DINO-5scale (36 epoch) |
| 2D Object Detection | COCO minival | APS | 35 | DINO-5scale (36 epoch) |
| 2D Object Detection | COCO minival | box AP | 51.2 | DINO-5scale (36 epoch) |
| 16k | COCO test-dev | box mAP | 63.3 | DINO (Swin-L,multi-scale, TTA) |
| 16k | COCO-O | Average mAP | 42.1 | DINO (Swin-L) |
| 16k | COCO-O | Effective Robustness | 15.76 | DINO (Swin-L) |
| 16k | SA-Det-100k | AP | 43.7 | DINO (ResNet50 1x VFL) |
| 16k | SA-Det-100k | AP50 | 52 | DINO (ResNet50 1x VFL) |
| 16k | SA-Det-100k | AP75 | 47.7 | DINO (ResNet50 1x VFL) |
| 16k | SA-Det-100k | APL | 61.5 | DINO (ResNet50 1x VFL) |
| 16k | SA-Det-100k | APM | 43 | DINO (ResNet50 1x VFL) |
| 16k | SA-Det-100k | APS | 5.8 | DINO (ResNet50 1x VFL) |
| 16k | COCO minival | box AP | 63.2 | DINO (Swin-L) |
| 16k | COCO minival | AP50 | 69.1 | DINO-5scale (24 epoch) |
| 16k | COCO minival | AP75 | 56 | DINO-5scale (24 epoch) |
| 16k | COCO minival | APL | 65.8 | DINO-5scale (24 epoch) |
| 16k | COCO minival | APM | 54.2 | DINO-5scale (24 epoch) |
| 16k | COCO minival | APS | 34.5 | DINO-5scale (24 epoch) |
| 16k | COCO minival | box AP | 51.3 | DINO-5scale (24 epoch) |
| 16k | COCO minival | AP50 | 69 | DINO-5scale (36 epoch) |
| 16k | COCO minival | AP75 | 55.8 | DINO-5scale (36 epoch) |
| 16k | COCO minival | APL | 65.3 | DINO-5scale (36 epoch) |
| 16k | COCO minival | APM | 54.3 | DINO-5scale (36 epoch) |
| 16k | COCO minival | APS | 35 | DINO-5scale (36 epoch) |
| 16k | COCO minival | box AP | 51.2 | DINO-5scale (36 epoch) |