Integrally Migrating Pre-trained Transformer Encoder-decoders for Visual Object Detection

Feng Liu, Xiaosong Zhang, Zhiliang Peng, Zonghao Guo, Fang Wan, Xiangyang Ji, Qixiang Ye

2022-05-19ICCV 2023 1Few-Shot Object Detection Representation Learning object-detection Object Detection

Abstract

Modern object detectors have taken the advantages of backbone networks pre-trained on large scale datasets. Except for the backbone networks, however, other components such as the detector head and the feature pyramid network (FPN) remain trained from scratch, which hinders fully tapping the potential of representation models. In this study, we propose to integrally migrate pre-trained transformer encoder-decoders (imTED) to a detector, constructing a feature extraction path which is ``fully pre-trained" so that detectors' generalization capacity is maximized. The essential differences between imTED with the baseline detector are twofold: (1) migrating the pre-trained transformer decoder to the detector head while removing the randomly initialized FPN from the feature extraction path; and (2) defining a multi-scale feature modulator (MFM) to enhance scale adaptability. Such designs not only reduce randomly initialized parameters significantly but also unify detector training with representation learning intendedly. Experiments on the MS COCO object detection dataset show that imTED consistently outperforms its counterparts by $\sim$2.4 AP. Without bells and whistles, imTED improves the state-of-the-art of few-shot object detection by up to 7.6 AP. Code is available at https://github.com/LiewFeng/imTED.

Results

Task	Dataset	Metric	Value	Model
Object Detection	MS-COCO (30-shot)	AP	30.2	imTED+ViT-B
Object Detection	MS-COCO (30-shot)	AP	21	imTED+ViT-S
Object Detection	MS-COCO (10-shot)	AP	22.5	imTED+ViT-B
Object Detection	MS-COCO (10-shot)	AP	15	imTED+ViT-S
3D	MS-COCO (30-shot)	AP	30.2	imTED+ViT-B
3D	MS-COCO (30-shot)	AP	21	imTED+ViT-S
3D	MS-COCO (10-shot)	AP	22.5	imTED+ViT-B
3D	MS-COCO (10-shot)	AP	15	imTED+ViT-S
Few-Shot Object Detection	MS-COCO (30-shot)	AP	30.2	imTED+ViT-B
Few-Shot Object Detection	MS-COCO (30-shot)	AP	21	imTED+ViT-S
Few-Shot Object Detection	MS-COCO (10-shot)	AP	22.5	imTED+ViT-B
Few-Shot Object Detection	MS-COCO (10-shot)	AP	15	imTED+ViT-S
2D Classification	MS-COCO (30-shot)	AP	30.2	imTED+ViT-B
2D Classification	MS-COCO (30-shot)	AP	21	imTED+ViT-S
2D Classification	MS-COCO (10-shot)	AP	22.5	imTED+ViT-B
2D Classification	MS-COCO (10-shot)	AP	15	imTED+ViT-S
2D Object Detection	MS-COCO (30-shot)	AP	30.2	imTED+ViT-B
2D Object Detection	MS-COCO (30-shot)	AP	21	imTED+ViT-S
2D Object Detection	MS-COCO (10-shot)	AP	22.5	imTED+ViT-B
2D Object Detection	MS-COCO (10-shot)	AP	15	imTED+ViT-S
16k	MS-COCO (30-shot)	AP	30.2	imTED+ViT-B
16k	MS-COCO (30-shot)	AP	21	imTED+ViT-S
16k	MS-COCO (10-shot)	AP	22.5	imTED+ViT-B
16k	MS-COCO (10-shot)	AP	15	imTED+ViT-S

Integrally Migrating Pre-trained Transformer Encoder-decoders for Visual Object Detection

Abstract

Results

Related Papers

Integrally Migrating Pre-trained Transformer Encoder-decoders for Visual Object Detection

Abstract

Results

Related Papers