TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Open-vocabulary Object Detection via Vision and Language K...

Open-vocabulary Object Detection via Vision and Language Knowledge Distillation

Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, Yin Cui

2021-04-28ICLR 2022 4Open Vocabulary Image ClassificationImage ClassificationZero-Shot Image ClassificationZero-Shot Object DetectionOpen Vocabulary Object DetectionKnowledge Distillationobject-detectionObject Detection
PaperPDFCode(official)Code(official)CodeCode

Abstract

We aim at advancing open-vocabulary object detection, which detects objects described by arbitrary text inputs. The fundamental challenge is the availability of training data. It is costly to further scale up the number of classes contained in existing object detection datasets. To overcome this challenge, we propose ViLD, a training method via Vision and Language knowledge Distillation. Our method distills the knowledge from a pretrained open-vocabulary image classification model (teacher) into a two-stage detector (student). Specifically, we use the teacher model to encode category texts and image regions of object proposals. Then we train a student detector, whose region embeddings of detected boxes are aligned with the text and image embeddings inferred by the teacher. We benchmark on LVIS by holding out all rare categories as novel categories that are not seen during training. ViLD obtains 16.1 mask AP$_r$ with a ResNet-50 backbone, even outperforming the supervised counterpart by 3.8. When trained with a stronger teacher model ALIGN, ViLD achieves 26.3 AP$_r$. The model can directly transfer to other datasets without finetuning, achieving 72.2 AP$_{50}$ on PASCAL VOC, 36.6 AP on COCO and 11.8 AP on Objects365. On COCO, ViLD outperforms the previous state-of-the-art by 4.8 on novel AP and 11.4 on overall AP. Code and demo are open-sourced at https://github.com/tensorflow/tpu/tree/master/models/official/detection/projects/vild.

Results

TaskDatasetMetricValueModel
Object DetectionObjects365mask AP5018.2ViLD
Object DetectionLVIS v1.0AP novel-LVIS base training26.3ViLD-ensemble w/ ALIGN (Eb7-FPN)
Object DetectionLVIS v1.0AP novel-Unrestricted open-vocabulary training27ViLD-ensemble w/ ALIGN (Eb7-FPN)
Object DetectionLVIS v1.0AP novel-LVIS base training18.7ViLD-ensemble (R152-FPN)
Object DetectionLVIS v1.0AP novel-Unrestricted open-vocabulary training19.8ViLD-ensemble (R152-FPN)
Object DetectionLVIS v1.0AP novel-LVIS base training16.6ViLD-ensemble (R50-FPN)
Object DetectionLVIS v1.0AP novel-Unrestricted open-vocabulary training16.7ViLD-ensemble (R50-FPN)
Object DetectionLVIS v1.0AP novel-LVIS base training16.1ViLD (R50-FPN)
Object DetectionLVIS v1.0AP novel-Unrestricted open-vocabulary training16.3ViLD (R50-FPN)
Object DetectionMSCOCOAP 0.527.6ViLD
3DObjects365mask AP5018.2ViLD
3DLVIS v1.0AP novel-LVIS base training26.3ViLD-ensemble w/ ALIGN (Eb7-FPN)
3DLVIS v1.0AP novel-Unrestricted open-vocabulary training27ViLD-ensemble w/ ALIGN (Eb7-FPN)
3DLVIS v1.0AP novel-LVIS base training18.7ViLD-ensemble (R152-FPN)
3DLVIS v1.0AP novel-Unrestricted open-vocabulary training19.8ViLD-ensemble (R152-FPN)
3DLVIS v1.0AP novel-LVIS base training16.6ViLD-ensemble (R50-FPN)
3DLVIS v1.0AP novel-Unrestricted open-vocabulary training16.7ViLD-ensemble (R50-FPN)
3DLVIS v1.0AP novel-LVIS base training16.1ViLD (R50-FPN)
3DLVIS v1.0AP novel-Unrestricted open-vocabulary training16.3ViLD (R50-FPN)
3DMSCOCOAP 0.527.6ViLD
2D ClassificationObjects365mask AP5018.2ViLD
2D ClassificationLVIS v1.0AP novel-LVIS base training26.3ViLD-ensemble w/ ALIGN (Eb7-FPN)
2D ClassificationLVIS v1.0AP novel-Unrestricted open-vocabulary training27ViLD-ensemble w/ ALIGN (Eb7-FPN)
2D ClassificationLVIS v1.0AP novel-LVIS base training18.7ViLD-ensemble (R152-FPN)
2D ClassificationLVIS v1.0AP novel-Unrestricted open-vocabulary training19.8ViLD-ensemble (R152-FPN)
2D ClassificationLVIS v1.0AP novel-LVIS base training16.6ViLD-ensemble (R50-FPN)
2D ClassificationLVIS v1.0AP novel-Unrestricted open-vocabulary training16.7ViLD-ensemble (R50-FPN)
2D ClassificationLVIS v1.0AP novel-LVIS base training16.1ViLD (R50-FPN)
2D ClassificationLVIS v1.0AP novel-Unrestricted open-vocabulary training16.3ViLD (R50-FPN)
2D ClassificationMSCOCOAP 0.527.6ViLD
2D Object DetectionObjects365mask AP5018.2ViLD
2D Object DetectionLVIS v1.0AP novel-LVIS base training26.3ViLD-ensemble w/ ALIGN (Eb7-FPN)
2D Object DetectionLVIS v1.0AP novel-Unrestricted open-vocabulary training27ViLD-ensemble w/ ALIGN (Eb7-FPN)
2D Object DetectionLVIS v1.0AP novel-LVIS base training18.7ViLD-ensemble (R152-FPN)
2D Object DetectionLVIS v1.0AP novel-Unrestricted open-vocabulary training19.8ViLD-ensemble (R152-FPN)
2D Object DetectionLVIS v1.0AP novel-LVIS base training16.6ViLD-ensemble (R50-FPN)
2D Object DetectionLVIS v1.0AP novel-Unrestricted open-vocabulary training16.7ViLD-ensemble (R50-FPN)
2D Object DetectionLVIS v1.0AP novel-LVIS base training16.1ViLD (R50-FPN)
2D Object DetectionLVIS v1.0AP novel-Unrestricted open-vocabulary training16.3ViLD (R50-FPN)
2D Object DetectionMSCOCOAP 0.527.6ViLD
Open Vocabulary Object DetectionObjects365mask AP5018.2ViLD
Open Vocabulary Object DetectionLVIS v1.0AP novel-LVIS base training26.3ViLD-ensemble w/ ALIGN (Eb7-FPN)
Open Vocabulary Object DetectionLVIS v1.0AP novel-Unrestricted open-vocabulary training27ViLD-ensemble w/ ALIGN (Eb7-FPN)
Open Vocabulary Object DetectionLVIS v1.0AP novel-LVIS base training18.7ViLD-ensemble (R152-FPN)
Open Vocabulary Object DetectionLVIS v1.0AP novel-Unrestricted open-vocabulary training19.8ViLD-ensemble (R152-FPN)
Open Vocabulary Object DetectionLVIS v1.0AP novel-LVIS base training16.6ViLD-ensemble (R50-FPN)
Open Vocabulary Object DetectionLVIS v1.0AP novel-Unrestricted open-vocabulary training16.7ViLD-ensemble (R50-FPN)
Open Vocabulary Object DetectionLVIS v1.0AP novel-LVIS base training16.1ViLD (R50-FPN)
Open Vocabulary Object DetectionLVIS v1.0AP novel-Unrestricted open-vocabulary training16.3ViLD (R50-FPN)
Open Vocabulary Object DetectionMSCOCOAP 0.527.6ViLD
16kObjects365mask AP5018.2ViLD
16kLVIS v1.0AP novel-LVIS base training26.3ViLD-ensemble w/ ALIGN (Eb7-FPN)
16kLVIS v1.0AP novel-Unrestricted open-vocabulary training27ViLD-ensemble w/ ALIGN (Eb7-FPN)
16kLVIS v1.0AP novel-LVIS base training18.7ViLD-ensemble (R152-FPN)
16kLVIS v1.0AP novel-Unrestricted open-vocabulary training19.8ViLD-ensemble (R152-FPN)
16kLVIS v1.0AP novel-LVIS base training16.6ViLD-ensemble (R50-FPN)
16kLVIS v1.0AP novel-Unrestricted open-vocabulary training16.7ViLD-ensemble (R50-FPN)
16kLVIS v1.0AP novel-LVIS base training16.1ViLD (R50-FPN)
16kLVIS v1.0AP novel-Unrestricted open-vocabulary training16.3ViLD (R50-FPN)
16kMSCOCOAP 0.527.6ViLD

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21Automatic Classification and Segmentation of Tunnel Cracks Based on Deep Learning and Visual Explanations2025-07-18Adversarial attacks to image classification systems using evolutionary algorithms2025-07-17Efficient Adaptation of Pre-trained Vision Transformer underpinned by Approximately Orthogonal Fine-Tuning Strategy2025-07-17Federated Learning for Commercial Image Sources2025-07-17MUPAX: Multidimensional Problem Agnostic eXplainable AI2025-07-17Uncertainty-Aware Cross-Modal Knowledge Distillation with Prototype Learning for Multimodal Brain-Computer Interfaces2025-07-17A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17