Enhancing Remote Sensing Vision-Language Models for Zero-Shot Scene Classification

Karim El Khoury, Maxime Zanella, Benoît Gérin, Tiffanie Godelaine, Benoît Macq, Saïd Mahmoudi, Christophe De Vleeschouwer, Ismail Ben Ayed

2024-09-01Scene Classification Transductive Zero-Shot Classification Zero-Shot Learning

Paper PDF Code(official)

Abstract

Vision-Language Models for remote sensing have shown promising uses thanks to their extensive pretraining. However, their conventional usage in zero-shot scene classification methods still involves dividing large images into patches and making independent predictions, i.e., inductive inference, thereby limiting their effectiveness by ignoring valuable contextual information. Our approach tackles this issue by utilizing initial predictions based on text prompting and patch affinity relationships from the image encoder to enhance zero-shot capabilities through transductive inference, all without the need for supervision and at a minor computational cost. Experiments on 10 remote sensing datasets with state-of-the-art Vision-Language Models demonstrate significant accuracy improvements over inductive zero-shot classification. Our source code is publicly available on Github: https://github.com/elkhouryk/RS-TransCLIP

Results

Task	Dataset	Metric	Value	Model
Zero-Shot Learning	EuroSAT	Accuracy	91.2	RS-TransCLIP
Zero-Shot Learning	RSICB256	Accuracy	72.8	RS-TransCLIP
Zero-Shot Learning	OPTIMAL31	Accuracy	94.5	RS-TransCLIP
Zero-Shot Learning	WHURS19	Accuracy	99.7	RS-TransCLIP
Zero-Shot Learning	PatternNet	Accuracy	96.2	RS-TransCLIP
Zero-Shot Learning	RESISC45	Accuracy	88	RS-TransCLIP
Zero-Shot Learning	AID	Accuracy	92.7	RS-TransCLIP
Zero-Shot Learning	MLRSNet	Accuracy	78.1	RS-TransCLIP
Zero-Shot Learning	RSC11	Accuracy	88.1	RS-TransCLIP
Zero-Shot Learning	RSICB128	Accuracy	54.8	RS-TransCLIP

Related Papers

GLAD: Generalizable Tuning for Vision-Language Models2025-07-17 DEARLi: Decoupled Enhancement of Recognition and Localization for Semi-supervised Panoptic Segmentation2025-07-14 EVA: Mixture-of-Experts Semantic Variant Alignment for Compositional Zero-Shot Learning2025-06-26 Zero-Shot Learning for Obsolescence Risk Forecasting2025-06-26 Towards Scalable and Generalizable Earth Observation Data Mining via Foundation Model Composition2025-06-25 SEZ-HARN: Self-Explainable Zero-shot Human Activity Recognition Network2025-06-25 A Multi-Scale Spatial Attention-Based Zero-Shot Learning Framework for Low-Light Image Enhancement2025-06-23 Generalizable Agent Modeling for Agent Collaboration-Competition Adaptation with Multi-Retrieval and Dynamic Generation2025-06-20