DUET: Cross-modal Semantic Grounding for Contrastive Zero-shot Learning

Zhuo Chen, Yufeng Huang, Jiaoyan Chen, Yuxia Geng, Wen Zhang, Yin Fang, Jeff Z. Pan, Huajun Chen

2022-07-04Image Classification Attribute Zero-Shot Image Classification Multi-Task Learning Contrastive Learning Zero-Shot Learning

Paper PDF Code(official)Code

Abstract

Zero-shot learning (ZSL) aims to predict unseen classes whose samples have never appeared during training. One of the most effective and widely used semantic information for zero-shot image classification are attributes which are annotations for class-level visual characteristics. However, the current methods often fail to discriminate those subtle visual distinctions between images due to not only the shortage of fine-grained annotations, but also the attribute imbalance and co-occurrence. In this paper, we present a transformer-based end-to-end ZSL method named DUET, which integrates latent semantic knowledge from the pre-trained language models (PLMs) via a self-supervised multi-modal learning paradigm. Specifically, we (1) developed a cross-modal semantic grounding network to investigate the model's capability of disentangling semantic attributes from the images; (2) applied an attribute-level contrastive learning strategy to further enhance the model's discrimination on fine-grained visual characteristics against the attribute co-occurrence and imbalance; (3) proposed a multi-task learning policy for considering multi-model objectives. We find that our DUET can achieve state-of-the-art performance on three standard ZSL benchmarks and a knowledge graph equipped ZSL benchmark. Its components are effective and its predictions are interpretable.

Results

Task	Dataset	Metric	Value	Model
Zero-Shot Learning	CUB-200-2011	Accuracy Seen	72.8	DUET
Zero-Shot Learning	CUB-200-2011	Accuracy Unseen	62.9	DUET
Zero-Shot Learning	CUB-200-2011	H	67.5	DUET
Zero-Shot Learning	CUB-200-2011	average top-1 classification accuracy	72.3	DUET
Zero-Shot Learning	AwA2	Accuracy Seen	84.7	DUET (Ours)
Zero-Shot Learning	AwA2	Accuracy Unseen	63.7	DUET (Ours)
Zero-Shot Learning	AwA2	H	72.7	DUET (Ours)
Zero-Shot Learning	AwA2	average top-1 classification accuracy	69.9	DUET (Ours)
Zero-Shot Learning	SUN Attribute	Accuracy Seen	45.8	DUET (Ours)
Zero-Shot Learning	SUN Attribute	Accuracy Unseen	45.7	DUET (Ours)
Zero-Shot Learning	SUN Attribute	H	45.8	DUET (Ours)
Zero-Shot Learning	SUN Attribute	average top-1 classification accuracy	64.4	DUET (Ours)

DUET: Cross-modal Semantic Grounding for Contrastive Zero-shot Learning

Abstract

Results

Related Papers

DUET: Cross-modal Semantic Grounding for Contrastive Zero-shot Learning

Abstract

Results

Related Papers