InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions

Wenhai Wang, Jifeng Dai, Zhe Chen, Zhenhang Huang, Zhiqi Li, Xizhou Zhu, Xiaowei Hu, Tong Lu, Lewei Lu, Hongsheng Li, Xiaogang Wang, Yu Qiao

2022-11-10CVPR 2023 1Image Classification Semantic Segmentation Instance Segmentation 2D Object Detection Classification Object Detection

Paper PDF Code(official)Code Code

Abstract

Compared to the great progress of large-scale vision transformers (ViTs) in recent years, large-scale models based on convolutional neural networks (CNNs) are still in an early state. This work presents a new large-scale CNN-based foundation model, termed InternImage, which can obtain the gain from increasing parameters and training data like ViTs. Different from the recent CNNs that focus on large dense kernels, InternImage takes deformable convolution as the core operator, so that our model not only has the large effective receptive field required for downstream tasks such as detection and segmentation, but also has the adaptive spatial aggregation conditioned by input and task information. As a result, the proposed InternImage reduces the strict inductive bias of traditional CNNs and makes it possible to learn stronger and more robust patterns with large-scale parameters from massive data like ViTs. The effectiveness of our model is proven on challenging benchmarks including ImageNet, COCO, and ADE20K. It is worth mentioning that InternImage-H achieved a new record 65.4 mAP on COCO test-dev and 62.9 mIoU on ADE20K, outperforming current leading CNNs and ViTs. The code will be released at https://github.com/OpenGVLab/InternImage.

Results

Task	Dataset	Metric	Value	Model
Semantic Segmentation	Replica	mIoU	38.4	InternImage
Semantic Segmentation	Cityscapes val	mIoU	87	InternImage-H
Semantic Segmentation	Cityscapes val	mIoU	86.4	InternImage-XL
Semantic Segmentation	PASCAL Context	mIoU	70.3	InternImage-H
Semantic Segmentation	ADE20K	GFLOPs	4635	InternImage-H
Semantic Segmentation	ADE20K	Params (M)	1310	InternImage-H
Semantic Segmentation	ADE20K	Validation mIoU	62.9	InternImage-H
Semantic Segmentation	ADE20K	GFLOPs	3142	InternImage-XL
Semantic Segmentation	ADE20K	Params (M)	368	InternImage-XL
Semantic Segmentation	ADE20K	Validation mIoU	55.3	InternImage-XL
Semantic Segmentation	ADE20K	GFLOPs	2526	InternImage-L
Semantic Segmentation	ADE20K	Params (M)	256	InternImage-L
Semantic Segmentation	ADE20K	Validation mIoU	54.1	InternImage-L
Semantic Segmentation	ADE20K	GFLOPs	1185	InternImage-B
Semantic Segmentation	ADE20K	Params (M)	128	InternImage-B
Semantic Segmentation	ADE20K	Validation mIoU	51.3	InternImage-B
Semantic Segmentation	ADE20K	GFLOPs	1017	InternImage-S
Semantic Segmentation	ADE20K	Params (M)	80	InternImage-S
Semantic Segmentation	ADE20K	Validation mIoU	50.9	InternImage-S
Semantic Segmentation	ADE20K	GFLOPs	944	InternImage-T
Semantic Segmentation	ADE20K	Params (M)	59	InternImage-T
Semantic Segmentation	ADE20K	Validation mIoU	48.1	InternImage-T
Semantic Segmentation	ADE20K	Params (M)	1310	InternImage-H (M3I Pre-training)
Object Detection	CrowdHuman (full body)	AP	97.2	InternImage-H
Object Detection	LVIS v1.0 minival	box AP	65.8	InternImage-H
Object Detection	COCO test-dev	Params (M)	2180	InternImage-H (M3I Pre-training)
Object Detection	COCO test-dev	box mAP	65.5	InternImage-H (M3I Pre-training)
Object Detection	COCO test-dev	Params (M)	602	InternImage-XL
Object Detection	COCO test-dev	box mAP	64.3	InternImage-XL
Object Detection	COCO-O	Average mAP	37	InternImage-L (Cascade Mask R-CNN)
Object Detection	COCO-O	Effective Robustness	11.72	InternImage-L (Cascade Mask R-CNN)
Object Detection	OpenImages-v6	box AP	74.1	InternImage-H
Object Detection	PASCAL VOC 2012	MAP	97.2	InternImage-H
Object Detection	COCO minival	box AP	65	InternImage-H
Object Detection	COCO minival	box AP	64.2	InternImage-XL
Object Detection	LVIS v1.0 val	box AP	63.2	InternImage-H
Image Classification	ImageNet	GFLOPs	1478	InternImage-H
Image Classification	ImageNet	GFLOPs	163	InternImage-XL
Image Classification	ImageNet	GFLOPs	108	InternImage-L
Image Classification	ImageNet	GFLOPs	16	InternImage-B
Image Classification	ImageNet	GFLOPs	8	InternImage-S
3D	CrowdHuman (full body)	AP	97.2	InternImage-H
3D	LVIS v1.0 minival	box AP	65.8	InternImage-H
3D	COCO test-dev	Params (M)	2180	InternImage-H (M3I Pre-training)
3D	COCO test-dev	box mAP	65.5	InternImage-H (M3I Pre-training)
3D	COCO test-dev	Params (M)	602	InternImage-XL
3D	COCO test-dev	box mAP	64.3	InternImage-XL
3D	COCO-O	Average mAP	37	InternImage-L (Cascade Mask R-CNN)
3D	COCO-O	Effective Robustness	11.72	InternImage-L (Cascade Mask R-CNN)
3D	OpenImages-v6	box AP	74.1	InternImage-H
3D	PASCAL VOC 2012	MAP	97.2	InternImage-H
3D	COCO minival	box AP	65	InternImage-H
3D	COCO minival	box AP	64.2	InternImage-XL
3D	LVIS v1.0 val	box AP	63.2	InternImage-H
Instance Segmentation	COCO minival	AP50	80.1	InternImage-H
Instance Segmentation	COCO minival	AP75	61.5	InternImage-H
Instance Segmentation	COCO minival	APL	74.4	InternImage-H
Instance Segmentation	COCO minival	APM	58.4	InternImage-H
Instance Segmentation	COCO minival	APS	37.9	InternImage-H
Instance Segmentation	COCO minival	mask AP	55.4	InternImage-H
Instance Segmentation	COCO minival	GFLOPs	1782	InternImage-XL
Instance Segmentation	COCO minival	Params (M)	387	InternImage-XL
Instance Segmentation	COCO minival	mask AP	48.8	InternImage-XL
Instance Segmentation	COCO minival	GFLOPs	1399	InternImage-L
Instance Segmentation	COCO minival	Params (M)	277	InternImage-L
Instance Segmentation	COCO minival	box AP	56.1	InternImage-L
Instance Segmentation	COCO minival	mask AP	48.5	InternImage-L
Instance Segmentation	COCO minival	GFLOPs	340	InternImage-S
Instance Segmentation	COCO minival	Params (M)	69	InternImage-S
Instance Segmentation	COCO minival	box AP	49.7	InternImage-S
Instance Segmentation	COCO minival	mask AP	44.5	InternImage-S
Instance Segmentation	COCO minival	GFLOPs	270	InternImage-T
Instance Segmentation	COCO minival	Params (M)	49	InternImage-T
Instance Segmentation	COCO minival	box AP	49.1	InternImage-T
Instance Segmentation	COCO minival	mask AP	43.7	InternImage-T
Instance Segmentation	COCO minival	GFLOPs	501	InternImage-B
Instance Segmentation	COCO minival	Params (M)	115	InternImage-B
Instance Segmentation	COCO test-dev	AP50	80.8	InternImage-H
Instance Segmentation	COCO test-dev	AP75	62.2	InternImage-H
Instance Segmentation	COCO test-dev	APL	70.3	InternImage-H
Instance Segmentation	COCO test-dev	APM	58.9	InternImage-H
Instance Segmentation	COCO test-dev	APS	41	InternImage-H
2D Classification	CrowdHuman (full body)	AP	97.2	InternImage-H
2D Classification	LVIS v1.0 minival	box AP	65.8	InternImage-H
2D Classification	COCO test-dev	Params (M)	2180	InternImage-H (M3I Pre-training)
2D Classification	COCO test-dev	box mAP	65.5	InternImage-H (M3I Pre-training)
2D Classification	COCO test-dev	Params (M)	602	InternImage-XL
2D Classification	COCO test-dev	box mAP	64.3	InternImage-XL
2D Classification	COCO-O	Average mAP	37	InternImage-L (Cascade Mask R-CNN)
2D Classification	COCO-O	Effective Robustness	11.72	InternImage-L (Cascade Mask R-CNN)
2D Classification	OpenImages-v6	box AP	74.1	InternImage-H
2D Classification	PASCAL VOC 2012	MAP	97.2	InternImage-H
2D Classification	COCO minival	box AP	65	InternImage-H
2D Classification	COCO minival	box AP	64.2	InternImage-XL
2D Classification	LVIS v1.0 val	box AP	63.2	InternImage-H
2D Object Detection	BDD100K val	mAP	38.8	InternImage-H
2D Object Detection	CrowdHuman (full body)	AP	97.2	InternImage-H
2D Object Detection	LVIS v1.0 minival	box AP	65.8	InternImage-H
2D Object Detection	COCO test-dev	Params (M)	2180	InternImage-H (M3I Pre-training)
2D Object Detection	COCO test-dev	box mAP	65.5	InternImage-H (M3I Pre-training)
2D Object Detection	COCO test-dev	Params (M)	602	InternImage-XL
2D Object Detection	COCO test-dev	box mAP	64.3	InternImage-XL
2D Object Detection	COCO-O	Average mAP	37	InternImage-L (Cascade Mask R-CNN)
2D Object Detection	COCO-O	Effective Robustness	11.72	InternImage-L (Cascade Mask R-CNN)
2D Object Detection	OpenImages-v6	box AP	74.1	InternImage-H
2D Object Detection	PASCAL VOC 2012	MAP	97.2	InternImage-H
2D Object Detection	COCO minival	box AP	65	InternImage-H
2D Object Detection	COCO minival	box AP	64.2	InternImage-XL
2D Object Detection	LVIS v1.0 val	box AP	63.2	InternImage-H
10-shot image generation	Replica	mIoU	38.4	InternImage
10-shot image generation	Cityscapes val	mIoU	87	InternImage-H
10-shot image generation	Cityscapes val	mIoU	86.4	InternImage-XL
10-shot image generation	PASCAL Context	mIoU	70.3	InternImage-H
10-shot image generation	ADE20K	GFLOPs	4635	InternImage-H
10-shot image generation	ADE20K	Params (M)	1310	InternImage-H
10-shot image generation	ADE20K	Validation mIoU	62.9	InternImage-H
10-shot image generation	ADE20K	GFLOPs	3142	InternImage-XL
10-shot image generation	ADE20K	Params (M)	368	InternImage-XL
10-shot image generation	ADE20K	Validation mIoU	55.3	InternImage-XL
10-shot image generation	ADE20K	GFLOPs	2526	InternImage-L
10-shot image generation	ADE20K	Params (M)	256	InternImage-L
10-shot image generation	ADE20K	Validation mIoU	54.1	InternImage-L
10-shot image generation	ADE20K	GFLOPs	1185	InternImage-B
10-shot image generation	ADE20K	Params (M)	128	InternImage-B
10-shot image generation	ADE20K	Validation mIoU	51.3	InternImage-B
10-shot image generation	ADE20K	GFLOPs	1017	InternImage-S
10-shot image generation	ADE20K	Params (M)	80	InternImage-S
10-shot image generation	ADE20K	Validation mIoU	50.9	InternImage-S
10-shot image generation	ADE20K	GFLOPs	944	InternImage-T
10-shot image generation	ADE20K	Params (M)	59	InternImage-T
10-shot image generation	ADE20K	Validation mIoU	48.1	InternImage-T
10-shot image generation	ADE20K	Params (M)	1310	InternImage-H (M3I Pre-training)
16k	CrowdHuman (full body)	AP	97.2	InternImage-H
16k	LVIS v1.0 minival	box AP	65.8	InternImage-H
16k	COCO test-dev	Params (M)	2180	InternImage-H (M3I Pre-training)
16k	COCO test-dev	box mAP	65.5	InternImage-H (M3I Pre-training)
16k	COCO test-dev	Params (M)	602	InternImage-XL
16k	COCO test-dev	box mAP	64.3	InternImage-XL
16k	COCO-O	Average mAP	37	InternImage-L (Cascade Mask R-CNN)
16k	COCO-O	Effective Robustness	11.72	InternImage-L (Cascade Mask R-CNN)
16k	OpenImages-v6	box AP	74.1	InternImage-H
16k	PASCAL VOC 2012	MAP	97.2	InternImage-H
16k	COCO minival	box AP	65	InternImage-H
16k	COCO minival	box AP	64.2	InternImage-XL
16k	LVIS v1.0 val	box AP	63.2	InternImage-H

Abstract

Results

Task	Dataset	Metric	Value	Model
Semantic Segmentation	Replica	mIoU	38.4	InternImage
Semantic Segmentation	Cityscapes val	mIoU	87	InternImage-H
Semantic Segmentation	Cityscapes val	mIoU	86.4	InternImage-XL
Semantic Segmentation	PASCAL Context	mIoU	70.3	InternImage-H
Semantic Segmentation	ADE20K	GFLOPs	4635	InternImage-H
Semantic Segmentation	ADE20K	Params (M)	1310	InternImage-H
Semantic Segmentation	ADE20K	Validation mIoU	62.9	InternImage-H
Semantic Segmentation	ADE20K	GFLOPs	3142	InternImage-XL
Semantic Segmentation	ADE20K	Params (M)	368	InternImage-XL
Semantic Segmentation	ADE20K	Validation mIoU	55.3	InternImage-XL
Semantic Segmentation	ADE20K	GFLOPs	2526	InternImage-L
Semantic Segmentation	ADE20K	Params (M)	256	InternImage-L
Semantic Segmentation	ADE20K	Validation mIoU	54.1	InternImage-L
Semantic Segmentation	ADE20K	GFLOPs	1185	InternImage-B
Semantic Segmentation	ADE20K	Params (M)	128	InternImage-B
Semantic Segmentation	ADE20K	Validation mIoU	51.3	InternImage-B
Semantic Segmentation	ADE20K	GFLOPs	1017	InternImage-S
Semantic Segmentation	ADE20K	Params (M)	80	InternImage-S
Semantic Segmentation	ADE20K	Validation mIoU	50.9	InternImage-S
Semantic Segmentation	ADE20K	GFLOPs	944	InternImage-T
Semantic Segmentation	ADE20K	Params (M)	59	InternImage-T
Semantic Segmentation	ADE20K	Validation mIoU	48.1	InternImage-T
Semantic Segmentation	ADE20K	Params (M)	1310	InternImage-H (M3I Pre-training)
Object Detection	CrowdHuman (full body)	AP	97.2	InternImage-H
Object Detection	LVIS v1.0 minival	box AP	65.8	InternImage-H
Object Detection	COCO test-dev	Params (M)	2180	InternImage-H (M3I Pre-training)
Object Detection	COCO test-dev	box mAP	65.5	InternImage-H (M3I Pre-training)
Object Detection	COCO test-dev	Params (M)	602	InternImage-XL
Object Detection	COCO test-dev	box mAP	64.3	InternImage-XL
Object Detection	COCO-O	Average mAP	37	InternImage-L (Cascade Mask R-CNN)
Object Detection	COCO-O	Effective Robustness	11.72	InternImage-L (Cascade Mask R-CNN)
Object Detection	OpenImages-v6	box AP	74.1	InternImage-H
Object Detection	PASCAL VOC 2012	MAP	97.2	InternImage-H
Object Detection	COCO minival	box AP	65	InternImage-H
Object Detection	COCO minival	box AP	64.2	InternImage-XL
Object Detection	LVIS v1.0 val	box AP	63.2	InternImage-H
Image Classification	ImageNet	GFLOPs	1478	InternImage-H
Image Classification	ImageNet	GFLOPs	163	InternImage-XL
Image Classification	ImageNet	GFLOPs	108	InternImage-L
Image Classification	ImageNet	GFLOPs	16	InternImage-B
Image Classification	ImageNet	GFLOPs	8	InternImage-S
3D	CrowdHuman (full body)	AP	97.2	InternImage-H
3D	LVIS v1.0 minival	box AP	65.8	InternImage-H
3D	COCO test-dev	Params (M)	2180	InternImage-H (M3I Pre-training)
3D	COCO test-dev	box mAP	65.5	InternImage-H (M3I Pre-training)
3D	COCO test-dev	Params (M)	602	InternImage-XL
3D	COCO test-dev	box mAP	64.3	InternImage-XL
3D	COCO-O	Average mAP	37	InternImage-L (Cascade Mask R-CNN)
3D	COCO-O	Effective Robustness	11.72	InternImage-L (Cascade Mask R-CNN)
3D	OpenImages-v6	box AP	74.1	InternImage-H
3D	PASCAL VOC 2012	MAP	97.2	InternImage-H
3D	COCO minival	box AP	65	InternImage-H
3D	COCO minival	box AP	64.2	InternImage-XL
3D	LVIS v1.0 val	box AP	63.2	InternImage-H
Instance Segmentation	COCO minival	AP50	80.1	InternImage-H
Instance Segmentation	COCO minival	AP75	61.5	InternImage-H
Instance Segmentation	COCO minival	APL	74.4	InternImage-H
Instance Segmentation	COCO minival	APM	58.4	InternImage-H
Instance Segmentation	COCO minival	APS	37.9	InternImage-H
Instance Segmentation	COCO minival	mask AP	55.4	InternImage-H
Instance Segmentation	COCO minival	GFLOPs	1782	InternImage-XL
Instance Segmentation	COCO minival	Params (M)	387	InternImage-XL
Instance Segmentation	COCO minival	mask AP	48.8	InternImage-XL
Instance Segmentation	COCO minival	GFLOPs	1399	InternImage-L
Instance Segmentation	COCO minival	Params (M)	277	InternImage-L
Instance Segmentation	COCO minival	box AP	56.1	InternImage-L
Instance Segmentation	COCO minival	mask AP	48.5	InternImage-L
Instance Segmentation	COCO minival	GFLOPs	340	InternImage-S
Instance Segmentation	COCO minival	Params (M)	69	InternImage-S
Instance Segmentation	COCO minival	box AP	49.7	InternImage-S
Instance Segmentation	COCO minival	mask AP	44.5	InternImage-S
Instance Segmentation	COCO minival	GFLOPs	270	InternImage-T
Instance Segmentation	COCO minival	Params (M)	49	InternImage-T
Instance Segmentation	COCO minival	box AP	49.1	InternImage-T
Instance Segmentation	COCO minival	mask AP	43.7	InternImage-T
Instance Segmentation	COCO minival	GFLOPs	501	InternImage-B
Instance Segmentation	COCO minival	Params (M)	115	InternImage-B
Instance Segmentation	COCO test-dev	AP50	80.8	InternImage-H
Instance Segmentation	COCO test-dev	AP75	62.2	InternImage-H
Instance Segmentation	COCO test-dev	APL	70.3	InternImage-H
Instance Segmentation	COCO test-dev	APM	58.9	InternImage-H
Instance Segmentation	COCO test-dev	APS	41	InternImage-H
2D Classification	CrowdHuman (full body)	AP	97.2	InternImage-H
2D Classification	LVIS v1.0 minival	box AP	65.8	InternImage-H
2D Classification	COCO test-dev	Params (M)	2180	InternImage-H (M3I Pre-training)
2D Classification	COCO test-dev	box mAP	65.5	InternImage-H (M3I Pre-training)
2D Classification	COCO test-dev	Params (M)	602	InternImage-XL
2D Classification	COCO test-dev	box mAP	64.3	InternImage-XL
2D Classification	COCO-O	Average mAP	37	InternImage-L (Cascade Mask R-CNN)
2D Classification	COCO-O	Effective Robustness	11.72	InternImage-L (Cascade Mask R-CNN)
2D Classification	OpenImages-v6	box AP	74.1	InternImage-H
2D Classification	PASCAL VOC 2012	MAP	97.2	InternImage-H
2D Classification	COCO minival	box AP	65	InternImage-H
2D Classification	COCO minival	box AP	64.2	InternImage-XL
2D Classification	LVIS v1.0 val	box AP	63.2	InternImage-H
2D Object Detection	BDD100K val	mAP	38.8	InternImage-H
2D Object Detection	CrowdHuman (full body)	AP	97.2	InternImage-H
2D Object Detection	LVIS v1.0 minival	box AP	65.8	InternImage-H
2D Object Detection	COCO test-dev	Params (M)	2180	InternImage-H (M3I Pre-training)
2D Object Detection	COCO test-dev	box mAP	65.5	InternImage-H (M3I Pre-training)
2D Object Detection	COCO test-dev	Params (M)	602	InternImage-XL
2D Object Detection	COCO test-dev	box mAP	64.3	InternImage-XL
2D Object Detection	COCO-O	Average mAP	37	InternImage-L (Cascade Mask R-CNN)
2D Object Detection	COCO-O	Effective Robustness	11.72	InternImage-L (Cascade Mask R-CNN)
2D Object Detection	OpenImages-v6	box AP	74.1	InternImage-H
2D Object Detection	PASCAL VOC 2012	MAP	97.2	InternImage-H
2D Object Detection	COCO minival	box AP	65	InternImage-H
2D Object Detection	COCO minival	box AP	64.2	InternImage-XL
2D Object Detection	LVIS v1.0 val	box AP	63.2	InternImage-H
10-shot image generation	Replica	mIoU	38.4	InternImage
10-shot image generation	Cityscapes val	mIoU	87	InternImage-H
10-shot image generation	Cityscapes val	mIoU	86.4	InternImage-XL
10-shot image generation	PASCAL Context	mIoU	70.3	InternImage-H
10-shot image generation	ADE20K	GFLOPs	4635	InternImage-H
10-shot image generation	ADE20K	Params (M)	1310	InternImage-H
10-shot image generation	ADE20K	Validation mIoU	62.9	InternImage-H
10-shot image generation	ADE20K	GFLOPs	3142	InternImage-XL
10-shot image generation	ADE20K	Params (M)	368	InternImage-XL
10-shot image generation	ADE20K	Validation mIoU	55.3	InternImage-XL
10-shot image generation	ADE20K	GFLOPs	2526	InternImage-L
10-shot image generation	ADE20K	Params (M)	256	InternImage-L
10-shot image generation	ADE20K	Validation mIoU	54.1	InternImage-L
10-shot image generation	ADE20K	GFLOPs	1185	InternImage-B
10-shot image generation	ADE20K	Params (M)	128	InternImage-B
10-shot image generation	ADE20K	Validation mIoU	51.3	InternImage-B
10-shot image generation	ADE20K	GFLOPs	1017	InternImage-S
10-shot image generation	ADE20K	Params (M)	80	InternImage-S
10-shot image generation	ADE20K	Validation mIoU	50.9	InternImage-S
10-shot image generation	ADE20K	GFLOPs	944	InternImage-T
10-shot image generation	ADE20K	Params (M)	59	InternImage-T
10-shot image generation	ADE20K	Validation mIoU	48.1	InternImage-T
10-shot image generation	ADE20K	Params (M)	1310	InternImage-H (M3I Pre-training)
16k	CrowdHuman (full body)	AP	97.2	InternImage-H
16k	LVIS v1.0 minival	box AP	65.8	InternImage-H
16k	COCO test-dev	Params (M)	2180	InternImage-H (M3I Pre-training)
16k	COCO test-dev	box mAP	65.5	InternImage-H (M3I Pre-training)
16k	COCO test-dev	Params (M)	602	InternImage-XL
16k	COCO test-dev	box mAP	64.3	InternImage-XL
16k	COCO-O	Average mAP	37	InternImage-L (Cascade Mask R-CNN)
16k	COCO-O	Effective Robustness	11.72	InternImage-L (Cascade Mask R-CNN)
16k	OpenImages-v6	box AP	74.1	InternImage-H
16k	PASCAL VOC 2012	MAP	97.2	InternImage-H
16k	COCO minival	box AP	65	InternImage-H
16k	COCO minival	box AP	64.2	InternImage-XL
16k	LVIS v1.0 val	box AP	63.2	InternImage-H

InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions

Abstract

Results

Related Papers

InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions

Abstract

Results

Related Papers