TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Leveraging Vision-Language Models for Improving Domain Gen...

Leveraging Vision-Language Models for Improving Domain Generalization in Image Classification

Sravanti Addepalli, Ashish Ramayee Asokan, Lakshay Sharma, R. Venkatesh Babu

2023-10-12CVPR 2024 1Image ClassificationDomain Generalization
PaperPDFCode(official)

Abstract

Vision-Language Models (VLMs) such as CLIP are trained on large amounts of image-text pairs, resulting in remarkable generalization across several data distributions. However, in several cases, their expensive training and data collection/curation costs do not justify the end application. This motivates a vendor-client paradigm, where a vendor trains a large-scale VLM and grants only input-output access to clients on a pay-per-query basis in a black-box setting. The client aims to minimize inference cost by distilling the VLM to a student model using the limited available task-specific data, and further deploying this student model in the downstream application. While naive distillation largely improves the In-Domain (ID) accuracy of the student, it fails to transfer the superior out-of-distribution (OOD) generalization of the VLM teacher using the limited available labeled images. To mitigate this, we propose Vision-Language to Vision - Align, Distill, Predict (VL2V-ADiP), which first aligns the vision and language modalities of the teacher model with the vision modality of a pre-trained student model, and further distills the aligned VLM representations to the student. This maximally retains the pre-trained features of the student, while also incorporating the rich representations of the VLM image encoder and the superior generalization of the text embeddings. The proposed approach achieves state-of-the-art results on the standard Domain Generalization benchmarks in a black-box teacher setting as well as a white-box setting where the weights of the VLM are accessible.

Results

TaskDatasetMetricValueModel
Domain AdaptationPACSAverage Accuracy96.68VL2V-SD (CLIP, ViT-B/16)
Domain AdaptationOffice-HomeAverage Accuracy87.38VL2V-SD (CLIP, ViT-B/16)
Domain AdaptationDomainNetAverage Accuracy62.79VL2V-SD (CLIP, ViT-B/16)
Domain AdaptationVLCSAverage Accuracy83.25VL2V-SD (CLIP, ViT-B/16)
Domain AdaptationTerraIncognitaAverage Accuracy58.54VL2V-SD (CLIP, ViT-B/16)
Domain GeneralizationPACSAverage Accuracy96.68VL2V-SD (CLIP, ViT-B/16)
Domain GeneralizationOffice-HomeAverage Accuracy87.38VL2V-SD (CLIP, ViT-B/16)
Domain GeneralizationDomainNetAverage Accuracy62.79VL2V-SD (CLIP, ViT-B/16)
Domain GeneralizationVLCSAverage Accuracy83.25VL2V-SD (CLIP, ViT-B/16)
Domain GeneralizationTerraIncognitaAverage Accuracy58.54VL2V-SD (CLIP, ViT-B/16)

Related Papers

Automatic Classification and Segmentation of Tunnel Cracks Based on Deep Learning and Visual Explanations2025-07-18Adversarial attacks to image classification systems using evolutionary algorithms2025-07-17Efficient Adaptation of Pre-trained Vision Transformer underpinned by Approximately Orthogonal Fine-Tuning Strategy2025-07-17Federated Learning for Commercial Image Sources2025-07-17MUPAX: Multidimensional Problem Agnostic eXplainable AI2025-07-17Simulate, Refocus and Ensemble: An Attention-Refocusing Scheme for Domain Generalization2025-07-17GLAD: Generalizable Tuning for Vision-Language Models2025-07-17MoTM: Towards a Foundation Model for Time Series Imputation based on Continuous Modeling2025-07-17