Wenhai Wang, Jifeng Dai, Zhe Chen, Zhenhang Huang, Zhiqi Li, Xizhou Zhu, Xiaowei Hu, Tong Lu, Lewei Lu, Hongsheng Li, Xiaogang Wang, Yu Qiao
Compared to the great progress of large-scale vision transformers (ViTs) in recent years, large-scale models based on convolutional neural networks (CNNs) are still in an early state. This work presents a new large-scale CNN-based foundation model, termed InternImage, which can obtain the gain from increasing parameters and training data like ViTs. Different from the recent CNNs that focus on large dense kernels, InternImage takes deformable convolution as the core operator, so that our model not only has the large effective receptive field required for downstream tasks such as detection and segmentation, but also has the adaptive spatial aggregation conditioned by input and task information. As a result, the proposed InternImage reduces the strict inductive bias of traditional CNNs and makes it possible to learn stronger and more robust patterns with large-scale parameters from massive data like ViTs. The effectiveness of our model is proven on challenging benchmarks including ImageNet, COCO, and ADE20K. It is worth mentioning that InternImage-H achieved a new record 65.4 mAP on COCO test-dev and 62.9 mIoU on ADE20K, outperforming current leading CNNs and ViTs. The code will be released at https://github.com/OpenGVLab/InternImage.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Semantic Segmentation | Replica | mIoU | 38.4 | InternImage |
| Semantic Segmentation | Cityscapes val | mIoU | 87 | InternImage-H |
| Semantic Segmentation | Cityscapes val | mIoU | 86.4 | InternImage-XL |
| Semantic Segmentation | PASCAL Context | mIoU | 70.3 | InternImage-H |
| Semantic Segmentation | ADE20K | GFLOPs | 4635 | InternImage-H |
| Semantic Segmentation | ADE20K | Params (M) | 1310 | InternImage-H |
| Semantic Segmentation | ADE20K | Validation mIoU | 62.9 | InternImage-H |
| Semantic Segmentation | ADE20K | GFLOPs | 3142 | InternImage-XL |
| Semantic Segmentation | ADE20K | Params (M) | 368 | InternImage-XL |
| Semantic Segmentation | ADE20K | Validation mIoU | 55.3 | InternImage-XL |
| Semantic Segmentation | ADE20K | GFLOPs | 2526 | InternImage-L |
| Semantic Segmentation | ADE20K | Params (M) | 256 | InternImage-L |
| Semantic Segmentation | ADE20K | Validation mIoU | 54.1 | InternImage-L |
| Semantic Segmentation | ADE20K | GFLOPs | 1185 | InternImage-B |
| Semantic Segmentation | ADE20K | Params (M) | 128 | InternImage-B |
| Semantic Segmentation | ADE20K | Validation mIoU | 51.3 | InternImage-B |
| Semantic Segmentation | ADE20K | GFLOPs | 1017 | InternImage-S |
| Semantic Segmentation | ADE20K | Params (M) | 80 | InternImage-S |
| Semantic Segmentation | ADE20K | Validation mIoU | 50.9 | InternImage-S |
| Semantic Segmentation | ADE20K | GFLOPs | 944 | InternImage-T |
| Semantic Segmentation | ADE20K | Params (M) | 59 | InternImage-T |
| Semantic Segmentation | ADE20K | Validation mIoU | 48.1 | InternImage-T |
| Semantic Segmentation | ADE20K | Params (M) | 1310 | InternImage-H (M3I Pre-training) |
| Object Detection | CrowdHuman (full body) | AP | 97.2 | InternImage-H |
| Object Detection | LVIS v1.0 minival | box AP | 65.8 | InternImage-H |
| Object Detection | COCO test-dev | Params (M) | 2180 | InternImage-H (M3I Pre-training) |
| Object Detection | COCO test-dev | box mAP | 65.5 | InternImage-H (M3I Pre-training) |
| Object Detection | COCO test-dev | Params (M) | 602 | InternImage-XL |
| Object Detection | COCO test-dev | box mAP | 64.3 | InternImage-XL |
| Object Detection | COCO-O | Average mAP | 37 | InternImage-L (Cascade Mask R-CNN) |
| Object Detection | COCO-O | Effective Robustness | 11.72 | InternImage-L (Cascade Mask R-CNN) |
| Object Detection | OpenImages-v6 | box AP | 74.1 | InternImage-H |
| Object Detection | PASCAL VOC 2012 | MAP | 97.2 | InternImage-H |
| Object Detection | COCO minival | box AP | 65 | InternImage-H |
| Object Detection | COCO minival | box AP | 64.2 | InternImage-XL |
| Object Detection | LVIS v1.0 val | box AP | 63.2 | InternImage-H |
| Image Classification | ImageNet | GFLOPs | 1478 | InternImage-H |
| Image Classification | ImageNet | GFLOPs | 163 | InternImage-XL |
| Image Classification | ImageNet | GFLOPs | 108 | InternImage-L |
| Image Classification | ImageNet | GFLOPs | 16 | InternImage-B |
| Image Classification | ImageNet | GFLOPs | 8 | InternImage-S |
| 3D | CrowdHuman (full body) | AP | 97.2 | InternImage-H |
| 3D | LVIS v1.0 minival | box AP | 65.8 | InternImage-H |
| 3D | COCO test-dev | Params (M) | 2180 | InternImage-H (M3I Pre-training) |
| 3D | COCO test-dev | box mAP | 65.5 | InternImage-H (M3I Pre-training) |
| 3D | COCO test-dev | Params (M) | 602 | InternImage-XL |
| 3D | COCO test-dev | box mAP | 64.3 | InternImage-XL |
| 3D | COCO-O | Average mAP | 37 | InternImage-L (Cascade Mask R-CNN) |
| 3D | COCO-O | Effective Robustness | 11.72 | InternImage-L (Cascade Mask R-CNN) |
| 3D | OpenImages-v6 | box AP | 74.1 | InternImage-H |
| 3D | PASCAL VOC 2012 | MAP | 97.2 | InternImage-H |
| 3D | COCO minival | box AP | 65 | InternImage-H |
| 3D | COCO minival | box AP | 64.2 | InternImage-XL |
| 3D | LVIS v1.0 val | box AP | 63.2 | InternImage-H |
| Instance Segmentation | COCO minival | AP50 | 80.1 | InternImage-H |
| Instance Segmentation | COCO minival | AP75 | 61.5 | InternImage-H |
| Instance Segmentation | COCO minival | APL | 74.4 | InternImage-H |
| Instance Segmentation | COCO minival | APM | 58.4 | InternImage-H |
| Instance Segmentation | COCO minival | APS | 37.9 | InternImage-H |
| Instance Segmentation | COCO minival | mask AP | 55.4 | InternImage-H |
| Instance Segmentation | COCO minival | GFLOPs | 1782 | InternImage-XL |
| Instance Segmentation | COCO minival | Params (M) | 387 | InternImage-XL |
| Instance Segmentation | COCO minival | mask AP | 48.8 | InternImage-XL |
| Instance Segmentation | COCO minival | GFLOPs | 1399 | InternImage-L |
| Instance Segmentation | COCO minival | Params (M) | 277 | InternImage-L |
| Instance Segmentation | COCO minival | box AP | 56.1 | InternImage-L |
| Instance Segmentation | COCO minival | mask AP | 48.5 | InternImage-L |
| Instance Segmentation | COCO minival | GFLOPs | 340 | InternImage-S |
| Instance Segmentation | COCO minival | Params (M) | 69 | InternImage-S |
| Instance Segmentation | COCO minival | box AP | 49.7 | InternImage-S |
| Instance Segmentation | COCO minival | mask AP | 44.5 | InternImage-S |
| Instance Segmentation | COCO minival | GFLOPs | 270 | InternImage-T |
| Instance Segmentation | COCO minival | Params (M) | 49 | InternImage-T |
| Instance Segmentation | COCO minival | box AP | 49.1 | InternImage-T |
| Instance Segmentation | COCO minival | mask AP | 43.7 | InternImage-T |
| Instance Segmentation | COCO minival | GFLOPs | 501 | InternImage-B |
| Instance Segmentation | COCO minival | Params (M) | 115 | InternImage-B |
| Instance Segmentation | COCO test-dev | AP50 | 80.8 | InternImage-H |
| Instance Segmentation | COCO test-dev | AP75 | 62.2 | InternImage-H |
| Instance Segmentation | COCO test-dev | APL | 70.3 | InternImage-H |
| Instance Segmentation | COCO test-dev | APM | 58.9 | InternImage-H |
| Instance Segmentation | COCO test-dev | APS | 41 | InternImage-H |
| 2D Classification | CrowdHuman (full body) | AP | 97.2 | InternImage-H |
| 2D Classification | LVIS v1.0 minival | box AP | 65.8 | InternImage-H |
| 2D Classification | COCO test-dev | Params (M) | 2180 | InternImage-H (M3I Pre-training) |
| 2D Classification | COCO test-dev | box mAP | 65.5 | InternImage-H (M3I Pre-training) |
| 2D Classification | COCO test-dev | Params (M) | 602 | InternImage-XL |
| 2D Classification | COCO test-dev | box mAP | 64.3 | InternImage-XL |
| 2D Classification | COCO-O | Average mAP | 37 | InternImage-L (Cascade Mask R-CNN) |
| 2D Classification | COCO-O | Effective Robustness | 11.72 | InternImage-L (Cascade Mask R-CNN) |
| 2D Classification | OpenImages-v6 | box AP | 74.1 | InternImage-H |
| 2D Classification | PASCAL VOC 2012 | MAP | 97.2 | InternImage-H |
| 2D Classification | COCO minival | box AP | 65 | InternImage-H |
| 2D Classification | COCO minival | box AP | 64.2 | InternImage-XL |
| 2D Classification | LVIS v1.0 val | box AP | 63.2 | InternImage-H |
| 2D Object Detection | BDD100K val | mAP | 38.8 | InternImage-H |
| 2D Object Detection | CrowdHuman (full body) | AP | 97.2 | InternImage-H |
| 2D Object Detection | LVIS v1.0 minival | box AP | 65.8 | InternImage-H |
| 2D Object Detection | COCO test-dev | Params (M) | 2180 | InternImage-H (M3I Pre-training) |
| 2D Object Detection | COCO test-dev | box mAP | 65.5 | InternImage-H (M3I Pre-training) |
| 2D Object Detection | COCO test-dev | Params (M) | 602 | InternImage-XL |
| 2D Object Detection | COCO test-dev | box mAP | 64.3 | InternImage-XL |
| 2D Object Detection | COCO-O | Average mAP | 37 | InternImage-L (Cascade Mask R-CNN) |
| 2D Object Detection | COCO-O | Effective Robustness | 11.72 | InternImage-L (Cascade Mask R-CNN) |
| 2D Object Detection | OpenImages-v6 | box AP | 74.1 | InternImage-H |
| 2D Object Detection | PASCAL VOC 2012 | MAP | 97.2 | InternImage-H |
| 2D Object Detection | COCO minival | box AP | 65 | InternImage-H |
| 2D Object Detection | COCO minival | box AP | 64.2 | InternImage-XL |
| 2D Object Detection | LVIS v1.0 val | box AP | 63.2 | InternImage-H |
| 10-shot image generation | Replica | mIoU | 38.4 | InternImage |
| 10-shot image generation | Cityscapes val | mIoU | 87 | InternImage-H |
| 10-shot image generation | Cityscapes val | mIoU | 86.4 | InternImage-XL |
| 10-shot image generation | PASCAL Context | mIoU | 70.3 | InternImage-H |
| 10-shot image generation | ADE20K | GFLOPs | 4635 | InternImage-H |
| 10-shot image generation | ADE20K | Params (M) | 1310 | InternImage-H |
| 10-shot image generation | ADE20K | Validation mIoU | 62.9 | InternImage-H |
| 10-shot image generation | ADE20K | GFLOPs | 3142 | InternImage-XL |
| 10-shot image generation | ADE20K | Params (M) | 368 | InternImage-XL |
| 10-shot image generation | ADE20K | Validation mIoU | 55.3 | InternImage-XL |
| 10-shot image generation | ADE20K | GFLOPs | 2526 | InternImage-L |
| 10-shot image generation | ADE20K | Params (M) | 256 | InternImage-L |
| 10-shot image generation | ADE20K | Validation mIoU | 54.1 | InternImage-L |
| 10-shot image generation | ADE20K | GFLOPs | 1185 | InternImage-B |
| 10-shot image generation | ADE20K | Params (M) | 128 | InternImage-B |
| 10-shot image generation | ADE20K | Validation mIoU | 51.3 | InternImage-B |
| 10-shot image generation | ADE20K | GFLOPs | 1017 | InternImage-S |
| 10-shot image generation | ADE20K | Params (M) | 80 | InternImage-S |
| 10-shot image generation | ADE20K | Validation mIoU | 50.9 | InternImage-S |
| 10-shot image generation | ADE20K | GFLOPs | 944 | InternImage-T |
| 10-shot image generation | ADE20K | Params (M) | 59 | InternImage-T |
| 10-shot image generation | ADE20K | Validation mIoU | 48.1 | InternImage-T |
| 10-shot image generation | ADE20K | Params (M) | 1310 | InternImage-H (M3I Pre-training) |
| 16k | CrowdHuman (full body) | AP | 97.2 | InternImage-H |
| 16k | LVIS v1.0 minival | box AP | 65.8 | InternImage-H |
| 16k | COCO test-dev | Params (M) | 2180 | InternImage-H (M3I Pre-training) |
| 16k | COCO test-dev | box mAP | 65.5 | InternImage-H (M3I Pre-training) |
| 16k | COCO test-dev | Params (M) | 602 | InternImage-XL |
| 16k | COCO test-dev | box mAP | 64.3 | InternImage-XL |
| 16k | COCO-O | Average mAP | 37 | InternImage-L (Cascade Mask R-CNN) |
| 16k | COCO-O | Effective Robustness | 11.72 | InternImage-L (Cascade Mask R-CNN) |
| 16k | OpenImages-v6 | box AP | 74.1 | InternImage-H |
| 16k | PASCAL VOC 2012 | MAP | 97.2 | InternImage-H |
| 16k | COCO minival | box AP | 65 | InternImage-H |
| 16k | COCO minival | box AP | 64.2 | InternImage-XL |
| 16k | LVIS v1.0 val | box AP | 63.2 | InternImage-H |