Zeynep Akata, Scott Reed, Daniel Walter, Honglak Lee, Bernt Schiele
Image classification has advanced significantly in recent years with the availability of large-scale image sets. However, fine-grained classification remains a major challenge due to the annotation cost of large numbers of fine-grained categories. This project shows that compelling classification performance can be achieved on such categories even without labeled training data. Given image and class embeddings, we learn a compatibility function such that matching embeddings are assigned a higher score than mismatching ones; zero-shot classification of an image proceeds by finding the label yielding the highest joint compatibility score. We use state-of-the-art image features and focus on different supervised attributes and unsupervised output embeddings either derived from hierarchies or learned from unlabeled text corpora. We establish a substantially improved state-of-the-art on the Animals with Attributes and Caltech-UCSD Birds datasets. Most encouragingly, we demonstrate that purely unsupervised output embeddings (learned from Wikipedia and improved with fine-grained text) achieve compelling results, even outperforming the previous supervised state-of-the-art. By combining different output embeddings, we further improve results.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Image Classification | CUB 200 50-way (0-shot) | Accuracy | 50.1 | SJE Akata et al. (2015) |
| Few-Shot Image Classification | CUB 200 50-way (0-shot) | Accuracy | 50.1 | SJE Akata et al. (2015) |
| Zero-Shot Action Recognition | UCF101 | Top-1 Accuracy | 12 | SJE(Attribute) |
| Zero-Shot Action Recognition | UCF101 | Top-1 Accuracy | 9.9 | SJE(Word Embedding) |
| Zero-Shot Action Recognition | Kinetics | Top-1 Accuracy | 22.3 | SJE(Word Embedding) |
| Zero-Shot Action Recognition | Kinetics | Top-5 Accuracy | 48.2 | SJE(Word Embedding) |
| Zero-Shot Action Recognition | HMDB51 | Top-1 Accuracy | 13.3 | SJE(word embedding) |
| Zero-Shot Action Recognition | Olympics | Top-1 Accuracy | 47.5 | SJE(Atrribute) |
| Zero-Shot Action Recognition | Olympics | Top-1 Accuracy | 28.6 | SJE(Word Embedding) |