TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Learning Deep Representations of Fine-grained Visual Descr...

Learning Deep Representations of Fine-grained Visual Descriptions

Scott Reed, Zeynep Akata, Bernt Schiele, Honglak Lee

2016-05-17CVPR 2016 6AttributeRetrievalZero-Shot LearningImage Retrieval
PaperPDFCodeCodeCodeCodeCodeCodeCodeCodeCode

Abstract

State-of-the-art methods for zero-shot visual recognition formulate learning as a joint embedding problem of images and side information. In these formulations the current best complement to visual features are attributes: manually encoded vectors describing shared characteristics among categories. Despite good performance, attributes have limitations: (1) finer-grained recognition requires commensurately more attributes, and (2) attributes do not provide a natural language interface. We propose to overcome these limitations by training neural language models from scratch; i.e. without pre-training and only consuming words and characters. Our proposed models train end-to-end to align with the fine-grained and category-specific content of images. Natural language provides a flexible and compact way of encoding only the salient visual aspects for distinguishing categories. By training on raw text, our model can do inference on raw text as well, providing humans a familiar mode both for annotation and retrieval. Our model achieves strong performance on zero-shot text-based image retrieval and significantly outperforms the attribute-based state-of-the-art for zero-shot classification on the Caltech UCSD Birds 200-2011 dataset.

Results

TaskDatasetMetricValueModel
Image ClassificationFlowers-102 - 0-ShotAP5059.6Word CNN-RNN (DS-SJE Embedding)
Image ClassificationCUB 200 50-way (0-shot)Accuracy50.9DA-SJE Reed et al. (2016)
Image ClassificationCUB 200 50-way (0-shot)Accuracy50.4DS-SJE Reed et al. (2016)
Image ClassificationCUB-200-2011 - 0-ShotAP5048.7Word CNN-RNN (DS-SJE Embedding)
Few-Shot Image ClassificationFlowers-102 - 0-ShotAP5059.6Word CNN-RNN (DS-SJE Embedding)
Few-Shot Image ClassificationCUB 200 50-way (0-shot)Accuracy50.9DA-SJE Reed et al. (2016)
Few-Shot Image ClassificationCUB 200 50-way (0-shot)Accuracy50.4DS-SJE Reed et al. (2016)
Few-Shot Image ClassificationCUB-200-2011 - 0-ShotAP5048.7Word CNN-RNN (DS-SJE Embedding)

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17GLAD: Generalizable Tuning for Vision-Language Models2025-07-17FAR-Net: Multi-Stage Fusion Network with Enhanced Semantic Alignment and Adaptive Reconciliation for Composed Image Retrieval2025-07-17MGFFD-VLM: Multi-Granularity Prompt Learning for Face Forgery Detection with VLM2025-07-16Non-Adaptive Adversarial Face Generation2025-07-16