A Vision-Language Foundation Model for Leaf Disease Identification

Khang Nguyen Quoc, Lan Le Thi Thu, Luyl-Da Quach

2025-05-11Image-text Retrieval Image Classification Text Retrieval Contrastive Learning

Abstract

Leaf disease identification plays a pivotal role in smart agriculture. However, many existing studies still struggle to integrate image and textual modalities to compensate for each other's limitations. Furthermore, many of these approaches rely on pretraining with constrained datasets such as ImageNet, which lack domain-specific information. We propose SCOLD (Soft-target COntrastive learning for Leaf Disease identification), a context-aware vision-language foundation model tailored to address these challenges for agricultural tasks. SCOLD is developed using a diverse corpus of plant leaf images and corresponding symptom descriptions, comprising over 186,000 image-caption pairs aligned with 97 unique concepts. Through task-agnostic pretraining, SCOLD leverages contextual soft targets to mitigate overconfidence in contrastive learning by smoothing labels, thereby improving model generalization and robustness on fine-grained classification tasks. Experimental results demonstrate that SCOLD outperforms existing vision-language models such as OpenAI-CLIP-L, BioCLIP, and SigLIP2 across several benchmarks, including zero-shot and few-shot classification, image-text retrieval, and image classification, while maintaining a competitive parameter footprint. Ablation studies further highlight SCOLD's effectiveness in contrast to its counterparts. The proposed approach significantly advances the agricultural vision-language foundation model, offering strong performance with minimal or no supervised fine-tuning. This work lays a solid groundwork for future research on models trained with long-form and simplified contexts, tasks involving class ambiguity, and multi-modal systems for intelligent plant disease diagnostics. The code for this study is available at https://huggingface.co/enalis/scold

Results

Task	Dataset	Metric	Value	Model
Image Classification	LeafNet	Accuracy (Top-1)	95.49	SCOLD
Image Classification	PlantVillage	Accuracy	99.96	SCOLD
Image Classification	PlantVillage	F1	99.95	SCOLD
Image Classification	PlantDoc	Accuracy	99.69	SCOLD

A Vision-Language Foundation Model for Leaf Disease Identification

Abstract

Results

Related Papers

A Vision-Language Foundation Model for Leaf Disease Identification

Abstract

Results

Related Papers