VL-LTR: Learning Class-wise Visual-Linguistic Representation for Long-Tailed Visual Recognition

Changyao Tian, Wenhai Wang, Xizhou Zhu, Jifeng Dai, Yu Qiao

2021-11-26Image Classification Long-tail Learning Transfer Learning

Abstract

Deep learning-based models encounter challenges when processing long-tailed data in the real world. Existing solutions usually employ some balancing strategies or transfer learning to deal with the class imbalance problem, based on the image modality. In this work, we present a visual-linguistic long-tailed recognition framework, termed VL-LTR, and conduct empirical studies on the benefits of introducing text modality for long-tailed recognition (LTR). Compared to existing approaches, the proposed VL-LTR has the following merits. (1) Our method can not only learn visual representation from images but also learn corresponding linguistic representation from noisy class-level text descriptions collected from the Internet; (2) Our method can effectively use the learned visual-linguistic representation to improve the visual recognition performance, especially for classes with fewer image samples. We also conduct extensive experiments and set the new state-of-the-art performance on widely-used LTR benchmarks. Notably, our method achieves 77.2% overall accuracy on ImageNet-LT, which significantly outperforms the previous best method by over 17 points, and is close to the prevailing performance training on the full ImageNet. Code is available at https://github.com/ChangyaoTian/VL-LTR.

Results

Task	Dataset	Metric	Value	Model
Image Classification	Places-LT	Top-1 Accuracy	50.1	VL-LTR (ViT-B-16)
Image Classification	Places-LT	Top-1 Accuracy	48	VL-LTR (ResNet-50)
Image Classification	ImageNet-LT	Top-1 Accuracy	77.2	VL-LTR (ViT-B-16)
Image Classification	ImageNet-LT	Top-1 Accuracy	70.1	VL-LTR (ResNet-50)
Few-Shot Image Classification	Places-LT	Top-1 Accuracy	50.1	VL-LTR (ViT-B-16)
Few-Shot Image Classification	Places-LT	Top-1 Accuracy	48	VL-LTR (ResNet-50)
Few-Shot Image Classification	ImageNet-LT	Top-1 Accuracy	77.2	VL-LTR (ViT-B-16)
Few-Shot Image Classification	ImageNet-LT	Top-1 Accuracy	70.1	VL-LTR (ResNet-50)
Generalized Few-Shot Classification	Places-LT	Top-1 Accuracy	50.1	VL-LTR (ViT-B-16)
Generalized Few-Shot Classification	Places-LT	Top-1 Accuracy	48	VL-LTR (ResNet-50)
Generalized Few-Shot Classification	ImageNet-LT	Top-1 Accuracy	77.2	VL-LTR (ViT-B-16)
Generalized Few-Shot Classification	ImageNet-LT	Top-1 Accuracy	70.1	VL-LTR (ResNet-50)
Long-tail Learning	Places-LT	Top-1 Accuracy	50.1	VL-LTR (ViT-B-16)
Long-tail Learning	Places-LT	Top-1 Accuracy	48	VL-LTR (ResNet-50)
Long-tail Learning	ImageNet-LT	Top-1 Accuracy	77.2	VL-LTR (ViT-B-16)
Long-tail Learning	ImageNet-LT	Top-1 Accuracy	70.1	VL-LTR (ResNet-50)
Generalized Few-Shot Learning	Places-LT	Top-1 Accuracy	50.1	VL-LTR (ViT-B-16)
Generalized Few-Shot Learning	Places-LT	Top-1 Accuracy	48	VL-LTR (ResNet-50)
Generalized Few-Shot Learning	ImageNet-LT	Top-1 Accuracy	77.2	VL-LTR (ViT-B-16)
Generalized Few-Shot Learning	ImageNet-LT	Top-1 Accuracy	70.1	VL-LTR (ResNet-50)

VL-LTR: Learning Class-wise Visual-Linguistic Representation for Long-Tailed Visual Recognition

Abstract

Results

Related Papers

VL-LTR: Learning Class-wise Visual-Linguistic Representation for Long-Tailed Visual Recognition

Abstract

Results

Related Papers