TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/A Simple Long-Tailed Recognition Baseline via Vision-Langu...

A Simple Long-Tailed Recognition Baseline via Vision-Language Model

Teli Ma, Shijie Geng, Mengmeng Wang, Jing Shao, Jiasen Lu, Hongsheng Li, Peng Gao, Yu Qiao

2021-11-29Long-tail LearningSemantic SimilaritySemantic Textual SimilarityContrastive LearningLanguage Modelling
PaperPDFCode(official)

Abstract

The visual world naturally exhibits a long-tailed distribution of open classes, which poses great challenges to modern visual systems. Existing approaches either perform class re-balancing strategies or directly improve network modules to address the problem. However, they still train models with a finite set of predefined labels, limiting their supervision information and restricting their transferability to novel instances. Recent advances in large-scale contrastive visual-language pretraining shed light on a new pathway for visual recognition. With open-vocabulary supervisions, pretrained contrastive vision-language models learn powerful multimodal representations that are promising to handle data deficiency and unseen concepts. By calculating the semantic similarity between visual and text inputs, visual recognition is converted to a vision-language matching problem. Inspired by this, we propose BALLAD to leverage contrastive vision-language models for long-tailed recognition. We first continue pretraining the vision-language backbone through contrastive learning on a specific long-tailed target dataset. Afterward, we freeze the backbone and further employ an additional adapter layer to enhance the representations of tail classes on balanced training samples built with re-sampling strategies. Extensive experiments have been conducted on three popular long-tailed recognition benchmarks. As a result, our simple and effective approach sets the new state-of-the-art performances and outperforms competitive baselines with a large margin. Code is released at https://github.com/gaopengcuhk/BALLAD.

Results

TaskDatasetMetricValueModel
Image ClassificationPlaces-LTTop-1 Accuracy49.5BALLAD(ViT-B-16)
Image ClassificationPlaces-LTTop-1 Accuracy49.3BALLAD(ResNet-50×16)
Image ClassificationPlaces-LTTop-1 Accuracy47.9BALLAD(ResNet-101)
Image ClassificationPlaces-LTTop-1 Accuracy46.5BALLAD(ResNet-50)
Image ClassificationImageNet-LTTop-1 Accuracy76.5BALLAD(ResNet-50×16)
Image ClassificationImageNet-LTTop-1 Accuracy75.7BALLAD(ViT-B-16)
Image ClassificationImageNet-LTTop-1 Accuracy70.5BALLAD(ResNet-101)
Image ClassificationImageNet-LTTop-1 Accuracy67.2BALLAD(ResNet-50)
Image ClassificationCIFAR-100-LT (ρ=100)Error Rate22.2BALLAD (ViT-B/16)
Few-Shot Image ClassificationPlaces-LTTop-1 Accuracy49.5BALLAD(ViT-B-16)
Few-Shot Image ClassificationPlaces-LTTop-1 Accuracy49.3BALLAD(ResNet-50×16)
Few-Shot Image ClassificationPlaces-LTTop-1 Accuracy47.9BALLAD(ResNet-101)
Few-Shot Image ClassificationPlaces-LTTop-1 Accuracy46.5BALLAD(ResNet-50)
Few-Shot Image ClassificationImageNet-LTTop-1 Accuracy76.5BALLAD(ResNet-50×16)
Few-Shot Image ClassificationImageNet-LTTop-1 Accuracy75.7BALLAD(ViT-B-16)
Few-Shot Image ClassificationImageNet-LTTop-1 Accuracy70.5BALLAD(ResNet-101)
Few-Shot Image ClassificationImageNet-LTTop-1 Accuracy67.2BALLAD(ResNet-50)
Few-Shot Image ClassificationCIFAR-100-LT (ρ=100)Error Rate22.2BALLAD (ViT-B/16)
Generalized Few-Shot ClassificationPlaces-LTTop-1 Accuracy49.5BALLAD(ViT-B-16)
Generalized Few-Shot ClassificationPlaces-LTTop-1 Accuracy49.3BALLAD(ResNet-50×16)
Generalized Few-Shot ClassificationPlaces-LTTop-1 Accuracy47.9BALLAD(ResNet-101)
Generalized Few-Shot ClassificationPlaces-LTTop-1 Accuracy46.5BALLAD(ResNet-50)
Generalized Few-Shot ClassificationImageNet-LTTop-1 Accuracy76.5BALLAD(ResNet-50×16)
Generalized Few-Shot ClassificationImageNet-LTTop-1 Accuracy75.7BALLAD(ViT-B-16)
Generalized Few-Shot ClassificationImageNet-LTTop-1 Accuracy70.5BALLAD(ResNet-101)
Generalized Few-Shot ClassificationImageNet-LTTop-1 Accuracy67.2BALLAD(ResNet-50)
Generalized Few-Shot ClassificationCIFAR-100-LT (ρ=100)Error Rate22.2BALLAD (ViT-B/16)
Long-tail LearningPlaces-LTTop-1 Accuracy49.5BALLAD(ViT-B-16)
Long-tail LearningPlaces-LTTop-1 Accuracy49.3BALLAD(ResNet-50×16)
Long-tail LearningPlaces-LTTop-1 Accuracy47.9BALLAD(ResNet-101)
Long-tail LearningPlaces-LTTop-1 Accuracy46.5BALLAD(ResNet-50)
Long-tail LearningImageNet-LTTop-1 Accuracy76.5BALLAD(ResNet-50×16)
Long-tail LearningImageNet-LTTop-1 Accuracy75.7BALLAD(ViT-B-16)
Long-tail LearningImageNet-LTTop-1 Accuracy70.5BALLAD(ResNet-101)
Long-tail LearningImageNet-LTTop-1 Accuracy67.2BALLAD(ResNet-50)
Long-tail LearningCIFAR-100-LT (ρ=100)Error Rate22.2BALLAD (ViT-B/16)
Generalized Few-Shot LearningPlaces-LTTop-1 Accuracy49.5BALLAD(ViT-B-16)
Generalized Few-Shot LearningPlaces-LTTop-1 Accuracy49.3BALLAD(ResNet-50×16)
Generalized Few-Shot LearningPlaces-LTTop-1 Accuracy47.9BALLAD(ResNet-101)
Generalized Few-Shot LearningPlaces-LTTop-1 Accuracy46.5BALLAD(ResNet-50)
Generalized Few-Shot LearningImageNet-LTTop-1 Accuracy76.5BALLAD(ResNet-50×16)
Generalized Few-Shot LearningImageNet-LTTop-1 Accuracy75.7BALLAD(ViT-B-16)
Generalized Few-Shot LearningImageNet-LTTop-1 Accuracy70.5BALLAD(ResNet-101)
Generalized Few-Shot LearningImageNet-LTTop-1 Accuracy67.2BALLAD(ResNet-50)
Generalized Few-Shot LearningCIFAR-100-LT (ρ=100)Error Rate22.2BALLAD (ViT-B/16)

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21SemCSE: Semantic Contrastive Sentence Embeddings Using LLM-Generated Summaries For Scientific Abstracts2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17Overview of the TalentCLEF 2025: Skill and Job Title Intelligence for Human Capital Management2025-07-17SGCL: Unifying Self-Supervised and Supervised Learning for Graph Recommendation2025-07-17Making Language Model a Hierarchical Classifier and Generator2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations2025-07-17