Alexander C. Li, Mihir Prabhudesai, Shivam Duggal, Ellis Brown, Deepak Pathak
The recent wave of large-scale text-to-image diffusion models has dramatically increased our text-based image generation abilities. These models can generate realistic images for a staggering variety of prompts and exhibit impressive compositional generalization abilities. Almost all use cases thus far have solely focused on sampling; however, diffusion models can also provide conditional density estimates, which are useful for tasks beyond image generation. In this paper, we show that the density estimates from large-scale text-to-image diffusion models like Stable Diffusion can be leveraged to perform zero-shot classification without any additional training. Our generative approach to classification, which we call Diffusion Classifier, attains strong results on a variety of benchmarks and outperforms alternative methods of extracting knowledge from diffusion models. Although a gap remains between generative and discriminative approaches on zero-shot recognition tasks, our diffusion-based approach has significantly stronger multimodal compositional reasoning ability than competing discriminative approaches. Finally, we use Diffusion Classifier to extract standard classifiers from class-conditional diffusion models trained on ImageNet. Our models achieve strong classification performance using only weak augmentations and exhibit qualitatively better "effective robustness" to distribution shift. Overall, our results are a step toward using generative over discriminative models for downstream tasks. Results and visualizations at https://diffusion-classifier.github.io/
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Domain Adaptation | ImageNet-A | Top-1 accuracy % | 30.2 | Diffusion Classifier |
| Visual Reasoning | Winoground | Text Score | 34 | Diffusion Classifier (zero-shot) |
| Image Classification | CIFAR-10 | Percentage correct | 88.5 | Diffusion Classifier (zero-shot) |
| Image Classification | Oxford-IIIT Pets | Per-Class Accuracy | 87.3 | Diffusion Classifier (zero-shot) |
| Image Classification | Flowers-102 | Per-Class Accuracy | 66.3 | Diffusion Classifier (zero-shot) |
| Image Classification | STL-10 | Percentage correct | 95.4 | Diffusion Classifier (zero-shot) |
| Image Classification | ObjectNet (ImageNet classes) | Top 1 Accuracy | 43.4 | Diffusion Classifier (zero-shot) |
| Image Classification | ObjectNet (ImageNet classes) | Top 1 Accuracy | 33.9 | Diffusion Classifier |
| Image Classification | FGVC Aircraft | Accuracy | 26.4 | Diffusion Classifier (zero-shot) |
| Fine-Grained Image Classification | FGVC Aircraft | Accuracy | 26.4 | Diffusion Classifier (zero-shot) |
| Zero-Shot Transfer Image Classification | ImageNet | Accuracy (Private) | 61.4 | Diffusion Classifier (zero-shot) |
| Zero-Shot Transfer Image Classification | Food-101 | Top 1 Accuracy | 77.7 | Diffusion Classifier (zero-shot) |
| Domain Generalization | ImageNet-A | Top-1 accuracy % | 30.2 | Diffusion Classifier |