Lukas Haas, Silas Alberti, Michal Skreta
Image geolocalization is the challenging task of predicting the geographic coordinates of origin for a given photo. It is an unsolved problem relying on the ability to combine visual clues with general knowledge about the world to make accurate predictions across geographies. We present $\href{https://huggingface.co/geolocal/StreetCLIP}{\text{StreetCLIP}}$, a robust, publicly available foundation model not only achieving state-of-the-art performance on multiple open-domain image geolocalization benchmarks but also doing so in a zero-shot setting, outperforming supervised models trained on more than 4 million images. Our method introduces a meta-learning approach for generalized zero-shot learning by pretraining CLIP from synthetic captions, grounding CLIP in a domain of choice. We show that our method effectively transfers CLIP's generalized zero-shot capabilities to the domain of image geolocalization, improving in-domain generalized zero-shot performance without finetuning StreetCLIP on a fixed set of classes.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Image Classification | Im2GPS3k | City level (25 km) | 22.4 | StreetCLIP (Zero-Shot) |
| Image Classification | Im2GPS3k | Continent level (2500 km) | 80.4 | StreetCLIP (Zero-Shot) |
| Image Classification | Im2GPS3k | Country level (750 km) | 61.3 | StreetCLIP (Zero-Shot) |
| Image Classification | Im2GPS3k | Region level (200 km) | 37.4 | StreetCLIP (Zero-Shot) |
| Image Classification | Im2GPS | City level (25 km) | 28.3 | StreetCLIP (Zero-Shot) |
| Image Classification | Im2GPS | Continent level (2500 km) | 88.2 | StreetCLIP (Zero-Shot) |
| Image Classification | Im2GPS | Country level (750 km) | 74.7 | StreetCLIP (Zero-Shot) |
| Image Classification | Im2GPS | Region level (200 km) | 45.1 | StreetCLIP (Zero-Shot) |
| 4K 60Fps | Im2GPS3k | City level (25 km) | 22.4 | StreetCLIP (Zero-Shot) |
| 4K 60Fps | Im2GPS3k | Continent level (2500 km) | 80.4 | StreetCLIP (Zero-Shot) |
| 4K 60Fps | Im2GPS3k | Country level (750 km) | 61.3 | StreetCLIP (Zero-Shot) |
| 4K 60Fps | Im2GPS3k | Region level (200 km) | 37.4 | StreetCLIP (Zero-Shot) |
| 4K 60Fps | Im2GPS | City level (25 km) | 28.3 | StreetCLIP (Zero-Shot) |
| 4K 60Fps | Im2GPS | Continent level (2500 km) | 88.2 | StreetCLIP (Zero-Shot) |
| 4K 60Fps | Im2GPS | Country level (750 km) | 74.7 | StreetCLIP (Zero-Shot) |
| 4K 60Fps | Im2GPS | Region level (200 km) | 45.1 | StreetCLIP (Zero-Shot) |