Open-Vocabulary Multi-Label Classification via Multi-Modal Knowledge Transfer

Sunan He, Taian Guo, Tao Dai, Ruizhi Qiao, Bo Ren, Shu-Tao Xia

2022-07-05Image-text matching Text Matching Transfer Learning Knowledge Distillation object-detection Multi-Label Classification Zero-Shot Learning Multi-label zero-shot learning Object Detection Language Modelling

Paper PDF Code(official)

Abstract

Real-world recognition system often encounters the challenge of unseen labels. To identify such unseen labels, multi-label zero-shot learning (ML-ZSL) focuses on transferring knowledge by a pre-trained textual label embedding (e.g., GloVe). However, such methods only exploit single-modal knowledge from a language model, while ignoring the rich semantic information inherent in image-text pairs. Instead, recently developed open-vocabulary (OV) based methods succeed in exploiting such information of image-text pairs in object detection, and achieve impressive performance. Inspired by the success of OV-based methods, we propose a novel open-vocabulary framework, named multi-modal knowledge transfer (MKT), for multi-label classification. Specifically, our method exploits multi-modal knowledge of image-text pairs based on a vision and language pre-training (VLP) model. To facilitate transferring the image-text matching ability of VLP model, knowledge distillation is employed to guarantee the consistency of image and label embeddings, along with prompt tuning to further update the label embeddings. To further enable the recognition of multiple objects, a simple but effective two-stream module is developed to capture both local and global features. Extensive experimental results show that our method significantly outperforms state-of-the-art methods on public benchmark datasets. The source code is available at https://github.com/sunanhe/MKT.

Results

Task	Dataset	Metric	Value	Model
Zero-Shot Learning	Open Images V4	MAP	89.2	MKT(IN-1K)
Zero-Shot Learning	NUS-WIDE	mAP	42.7	MKT(CLIP)
Zero-Shot Learning	NUS-WIDE	mAP	37.6	MKT(IN-1K)

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21 RaMen: Multi-Strategy Multi-Modal Learning for Bundle Construction2025-07-18 Disentangling coincident cell events using deep transfer learning and compressive sensing2025-07-17 Uncertainty-Aware Cross-Modal Knowledge Distillation with Prototype Learning for Multimodal Brain-Computer Interfaces2025-07-17 A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17 RS-TinyNet: Stage-wise Feature Fusion Network for Detecting Tiny Objects in Remote Sensing Images2025-07-17 Decoupled PROB: Decoupled Query Initialization Tasks and Objectness-Class Learning for Open World Object Detection2025-07-17 Dual LiDAR-Based Traffic Movement Count Estimation at a Signalized Intersection: Deployment, Data Collection, and Preliminary Analysis2025-07-17