Open-vocabulary Object Detection via Vision and Language Knowledge Distillation

Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, Yin Cui

2021-04-28ICLR 2022 4Open Vocabulary Image Classification Image Classification Zero-Shot Image Classification Zero-Shot Object Detection Open Vocabulary Object Detection Knowledge Distillation object-detection Object Detection

Paper PDF Code(official)Code(official)Code Code

Abstract

We aim at advancing open-vocabulary object detection, which detects objects described by arbitrary text inputs. The fundamental challenge is the availability of training data. It is costly to further scale up the number of classes contained in existing object detection datasets. To overcome this challenge, we propose ViLD, a training method via Vision and Language knowledge Distillation. Our method distills the knowledge from a pretrained open-vocabulary image classification model (teacher) into a two-stage detector (student). Specifically, we use the teacher model to encode category texts and image regions of object proposals. Then we train a student detector, whose region embeddings of detected boxes are aligned with the text and image embeddings inferred by the teacher. We benchmark on LVIS by holding out all rare categories as novel categories that are not seen during training. ViLD obtains 16.1 mask AP$_r$ with a ResNet-50 backbone, even outperforming the supervised counterpart by 3.8. When trained with a stronger teacher model ALIGN, ViLD achieves 26.3 AP$_r$. The model can directly transfer to other datasets without finetuning, achieving 72.2 AP$_{50}$ on PASCAL VOC, 36.6 AP on COCO and 11.8 AP on Objects365. On COCO, ViLD outperforms the previous state-of-the-art by 4.8 on novel AP and 11.4 on overall AP. Code and demo are open-sourced at https://github.com/tensorflow/tpu/tree/master/models/official/detection/projects/vild.

Results

Task	Dataset	Metric	Value	Model
Object Detection	Objects365	mask AP50	18.2	ViLD
Object Detection	LVIS v1.0	AP novel-LVIS base training	26.3	ViLD-ensemble w/ ALIGN (Eb7-FPN)
Object Detection	LVIS v1.0	AP novel-Unrestricted open-vocabulary training	27	ViLD-ensemble w/ ALIGN (Eb7-FPN)
Object Detection	LVIS v1.0	AP novel-LVIS base training	18.7	ViLD-ensemble (R152-FPN)
Object Detection	LVIS v1.0	AP novel-Unrestricted open-vocabulary training	19.8	ViLD-ensemble (R152-FPN)
Object Detection	LVIS v1.0	AP novel-LVIS base training	16.6	ViLD-ensemble (R50-FPN)
Object Detection	LVIS v1.0	AP novel-Unrestricted open-vocabulary training	16.7	ViLD-ensemble (R50-FPN)
Object Detection	LVIS v1.0	AP novel-LVIS base training	16.1	ViLD (R50-FPN)
Object Detection	LVIS v1.0	AP novel-Unrestricted open-vocabulary training	16.3	ViLD (R50-FPN)
Object Detection	MSCOCO	AP 0.5	27.6	ViLD
3D	Objects365	mask AP50	18.2	ViLD
3D	LVIS v1.0	AP novel-LVIS base training	26.3	ViLD-ensemble w/ ALIGN (Eb7-FPN)
3D	LVIS v1.0	AP novel-Unrestricted open-vocabulary training	27	ViLD-ensemble w/ ALIGN (Eb7-FPN)
3D	LVIS v1.0	AP novel-LVIS base training	18.7	ViLD-ensemble (R152-FPN)
3D	LVIS v1.0	AP novel-Unrestricted open-vocabulary training	19.8	ViLD-ensemble (R152-FPN)
3D	LVIS v1.0	AP novel-LVIS base training	16.6	ViLD-ensemble (R50-FPN)
3D	LVIS v1.0	AP novel-Unrestricted open-vocabulary training	16.7	ViLD-ensemble (R50-FPN)
3D	LVIS v1.0	AP novel-LVIS base training	16.1	ViLD (R50-FPN)
3D	LVIS v1.0	AP novel-Unrestricted open-vocabulary training	16.3	ViLD (R50-FPN)
3D	MSCOCO	AP 0.5	27.6	ViLD
2D Classification	Objects365	mask AP50	18.2	ViLD
2D Classification	LVIS v1.0	AP novel-LVIS base training	26.3	ViLD-ensemble w/ ALIGN (Eb7-FPN)
2D Classification	LVIS v1.0	AP novel-Unrestricted open-vocabulary training	27	ViLD-ensemble w/ ALIGN (Eb7-FPN)
2D Classification	LVIS v1.0	AP novel-LVIS base training	18.7	ViLD-ensemble (R152-FPN)
2D Classification	LVIS v1.0	AP novel-Unrestricted open-vocabulary training	19.8	ViLD-ensemble (R152-FPN)
2D Classification	LVIS v1.0	AP novel-LVIS base training	16.6	ViLD-ensemble (R50-FPN)
2D Classification	LVIS v1.0	AP novel-Unrestricted open-vocabulary training	16.7	ViLD-ensemble (R50-FPN)
2D Classification	LVIS v1.0	AP novel-LVIS base training	16.1	ViLD (R50-FPN)
2D Classification	LVIS v1.0	AP novel-Unrestricted open-vocabulary training	16.3	ViLD (R50-FPN)
2D Classification	MSCOCO	AP 0.5	27.6	ViLD
2D Object Detection	Objects365	mask AP50	18.2	ViLD
2D Object Detection	LVIS v1.0	AP novel-LVIS base training	26.3	ViLD-ensemble w/ ALIGN (Eb7-FPN)
2D Object Detection	LVIS v1.0	AP novel-Unrestricted open-vocabulary training	27	ViLD-ensemble w/ ALIGN (Eb7-FPN)
2D Object Detection	LVIS v1.0	AP novel-LVIS base training	18.7	ViLD-ensemble (R152-FPN)
2D Object Detection	LVIS v1.0	AP novel-Unrestricted open-vocabulary training	19.8	ViLD-ensemble (R152-FPN)
2D Object Detection	LVIS v1.0	AP novel-LVIS base training	16.6	ViLD-ensemble (R50-FPN)
2D Object Detection	LVIS v1.0	AP novel-Unrestricted open-vocabulary training	16.7	ViLD-ensemble (R50-FPN)
2D Object Detection	LVIS v1.0	AP novel-LVIS base training	16.1	ViLD (R50-FPN)
2D Object Detection	LVIS v1.0	AP novel-Unrestricted open-vocabulary training	16.3	ViLD (R50-FPN)
2D Object Detection	MSCOCO	AP 0.5	27.6	ViLD
Open Vocabulary Object Detection	Objects365	mask AP50	18.2	ViLD
Open Vocabulary Object Detection	LVIS v1.0	AP novel-LVIS base training	26.3	ViLD-ensemble w/ ALIGN (Eb7-FPN)
Open Vocabulary Object Detection	LVIS v1.0	AP novel-Unrestricted open-vocabulary training	27	ViLD-ensemble w/ ALIGN (Eb7-FPN)
Open Vocabulary Object Detection	LVIS v1.0	AP novel-LVIS base training	18.7	ViLD-ensemble (R152-FPN)
Open Vocabulary Object Detection	LVIS v1.0	AP novel-Unrestricted open-vocabulary training	19.8	ViLD-ensemble (R152-FPN)
Open Vocabulary Object Detection	LVIS v1.0	AP novel-LVIS base training	16.6	ViLD-ensemble (R50-FPN)
Open Vocabulary Object Detection	LVIS v1.0	AP novel-Unrestricted open-vocabulary training	16.7	ViLD-ensemble (R50-FPN)
Open Vocabulary Object Detection	LVIS v1.0	AP novel-LVIS base training	16.1	ViLD (R50-FPN)
Open Vocabulary Object Detection	LVIS v1.0	AP novel-Unrestricted open-vocabulary training	16.3	ViLD (R50-FPN)
Open Vocabulary Object Detection	MSCOCO	AP 0.5	27.6	ViLD
16k	Objects365	mask AP50	18.2	ViLD
16k	LVIS v1.0	AP novel-LVIS base training	26.3	ViLD-ensemble w/ ALIGN (Eb7-FPN)
16k	LVIS v1.0	AP novel-Unrestricted open-vocabulary training	27	ViLD-ensemble w/ ALIGN (Eb7-FPN)
16k	LVIS v1.0	AP novel-LVIS base training	18.7	ViLD-ensemble (R152-FPN)
16k	LVIS v1.0	AP novel-Unrestricted open-vocabulary training	19.8	ViLD-ensemble (R152-FPN)
16k	LVIS v1.0	AP novel-LVIS base training	16.6	ViLD-ensemble (R50-FPN)
16k	LVIS v1.0	AP novel-Unrestricted open-vocabulary training	16.7	ViLD-ensemble (R50-FPN)
16k	LVIS v1.0	AP novel-LVIS base training	16.1	ViLD (R50-FPN)
16k	LVIS v1.0	AP novel-Unrestricted open-vocabulary training	16.3	ViLD (R50-FPN)
16k	MSCOCO	AP 0.5	27.6	ViLD

Abstract

Results

Task	Dataset	Metric	Value	Model
Object Detection	Objects365	mask AP50	18.2	ViLD
Object Detection	LVIS v1.0	AP novel-LVIS base training	26.3	ViLD-ensemble w/ ALIGN (Eb7-FPN)
Object Detection	LVIS v1.0	AP novel-Unrestricted open-vocabulary training	27	ViLD-ensemble w/ ALIGN (Eb7-FPN)
Object Detection	LVIS v1.0	AP novel-LVIS base training	18.7	ViLD-ensemble (R152-FPN)
Object Detection	LVIS v1.0	AP novel-Unrestricted open-vocabulary training	19.8	ViLD-ensemble (R152-FPN)
Object Detection	LVIS v1.0	AP novel-LVIS base training	16.6	ViLD-ensemble (R50-FPN)
Object Detection	LVIS v1.0	AP novel-Unrestricted open-vocabulary training	16.7	ViLD-ensemble (R50-FPN)
Object Detection	LVIS v1.0	AP novel-LVIS base training	16.1	ViLD (R50-FPN)
Object Detection	LVIS v1.0	AP novel-Unrestricted open-vocabulary training	16.3	ViLD (R50-FPN)
Object Detection	MSCOCO	AP 0.5	27.6	ViLD
3D	Objects365	mask AP50	18.2	ViLD
3D	LVIS v1.0	AP novel-LVIS base training	26.3	ViLD-ensemble w/ ALIGN (Eb7-FPN)
3D	LVIS v1.0	AP novel-Unrestricted open-vocabulary training	27	ViLD-ensemble w/ ALIGN (Eb7-FPN)
3D	LVIS v1.0	AP novel-LVIS base training	18.7	ViLD-ensemble (R152-FPN)
3D	LVIS v1.0	AP novel-Unrestricted open-vocabulary training	19.8	ViLD-ensemble (R152-FPN)
3D	LVIS v1.0	AP novel-LVIS base training	16.6	ViLD-ensemble (R50-FPN)
3D	LVIS v1.0	AP novel-Unrestricted open-vocabulary training	16.7	ViLD-ensemble (R50-FPN)
3D	LVIS v1.0	AP novel-LVIS base training	16.1	ViLD (R50-FPN)
3D	LVIS v1.0	AP novel-Unrestricted open-vocabulary training	16.3	ViLD (R50-FPN)
3D	MSCOCO	AP 0.5	27.6	ViLD
2D Classification	Objects365	mask AP50	18.2	ViLD
2D Classification	LVIS v1.0	AP novel-LVIS base training	26.3	ViLD-ensemble w/ ALIGN (Eb7-FPN)
2D Classification	LVIS v1.0	AP novel-Unrestricted open-vocabulary training	27	ViLD-ensemble w/ ALIGN (Eb7-FPN)
2D Classification	LVIS v1.0	AP novel-LVIS base training	18.7	ViLD-ensemble (R152-FPN)
2D Classification	LVIS v1.0	AP novel-Unrestricted open-vocabulary training	19.8	ViLD-ensemble (R152-FPN)
2D Classification	LVIS v1.0	AP novel-LVIS base training	16.6	ViLD-ensemble (R50-FPN)
2D Classification	LVIS v1.0	AP novel-Unrestricted open-vocabulary training	16.7	ViLD-ensemble (R50-FPN)
2D Classification	LVIS v1.0	AP novel-LVIS base training	16.1	ViLD (R50-FPN)
2D Classification	LVIS v1.0	AP novel-Unrestricted open-vocabulary training	16.3	ViLD (R50-FPN)
2D Classification	MSCOCO	AP 0.5	27.6	ViLD
2D Object Detection	Objects365	mask AP50	18.2	ViLD
2D Object Detection	LVIS v1.0	AP novel-LVIS base training	26.3	ViLD-ensemble w/ ALIGN (Eb7-FPN)
2D Object Detection	LVIS v1.0	AP novel-Unrestricted open-vocabulary training	27	ViLD-ensemble w/ ALIGN (Eb7-FPN)
2D Object Detection	LVIS v1.0	AP novel-LVIS base training	18.7	ViLD-ensemble (R152-FPN)
2D Object Detection	LVIS v1.0	AP novel-Unrestricted open-vocabulary training	19.8	ViLD-ensemble (R152-FPN)
2D Object Detection	LVIS v1.0	AP novel-LVIS base training	16.6	ViLD-ensemble (R50-FPN)
2D Object Detection	LVIS v1.0	AP novel-Unrestricted open-vocabulary training	16.7	ViLD-ensemble (R50-FPN)
2D Object Detection	LVIS v1.0	AP novel-LVIS base training	16.1	ViLD (R50-FPN)
2D Object Detection	LVIS v1.0	AP novel-Unrestricted open-vocabulary training	16.3	ViLD (R50-FPN)
2D Object Detection	MSCOCO	AP 0.5	27.6	ViLD
Open Vocabulary Object Detection	Objects365	mask AP50	18.2	ViLD
Open Vocabulary Object Detection	LVIS v1.0	AP novel-LVIS base training	26.3	ViLD-ensemble w/ ALIGN (Eb7-FPN)
Open Vocabulary Object Detection	LVIS v1.0	AP novel-Unrestricted open-vocabulary training	27	ViLD-ensemble w/ ALIGN (Eb7-FPN)
Open Vocabulary Object Detection	LVIS v1.0	AP novel-LVIS base training	18.7	ViLD-ensemble (R152-FPN)
Open Vocabulary Object Detection	LVIS v1.0	AP novel-Unrestricted open-vocabulary training	19.8	ViLD-ensemble (R152-FPN)
Open Vocabulary Object Detection	LVIS v1.0	AP novel-LVIS base training	16.6	ViLD-ensemble (R50-FPN)
Open Vocabulary Object Detection	LVIS v1.0	AP novel-Unrestricted open-vocabulary training	16.7	ViLD-ensemble (R50-FPN)
Open Vocabulary Object Detection	LVIS v1.0	AP novel-LVIS base training	16.1	ViLD (R50-FPN)
Open Vocabulary Object Detection	LVIS v1.0	AP novel-Unrestricted open-vocabulary training	16.3	ViLD (R50-FPN)
Open Vocabulary Object Detection	MSCOCO	AP 0.5	27.6	ViLD
16k	Objects365	mask AP50	18.2	ViLD
16k	LVIS v1.0	AP novel-LVIS base training	26.3	ViLD-ensemble w/ ALIGN (Eb7-FPN)
16k	LVIS v1.0	AP novel-Unrestricted open-vocabulary training	27	ViLD-ensemble w/ ALIGN (Eb7-FPN)
16k	LVIS v1.0	AP novel-LVIS base training	18.7	ViLD-ensemble (R152-FPN)
16k	LVIS v1.0	AP novel-Unrestricted open-vocabulary training	19.8	ViLD-ensemble (R152-FPN)
16k	LVIS v1.0	AP novel-LVIS base training	16.6	ViLD-ensemble (R50-FPN)
16k	LVIS v1.0	AP novel-Unrestricted open-vocabulary training	16.7	ViLD-ensemble (R50-FPN)
16k	LVIS v1.0	AP novel-LVIS base training	16.1	ViLD (R50-FPN)
16k	LVIS v1.0	AP novel-Unrestricted open-vocabulary training	16.3	ViLD (R50-FPN)
16k	MSCOCO	AP 0.5	27.6	ViLD

Open-vocabulary Object Detection via Vision and Language Knowledge Distillation

Abstract

Results

Related Papers

Open-vocabulary Object Detection via Vision and Language Knowledge Distillation

Abstract

Results

Related Papers