Issar Tzachor, Boaz Lerner, Matan Levy, Michael Green, Tal Berkovitz Shalev, Gavriel Habib, Dvir Samuel, Noam Korngut Zailer, Or Shimshi, Nir Darshan, Rami Ben-Ari
The task of Visual Place Recognition (VPR) is to predict the location of a query image from a database of geo-tagged images. Recent studies in VPR have highlighted the significant advantage of employing pre-trained foundation models like DINOv2 for the VPR task. However, these models are often deemed inadequate for VPR without further fine-tuning on VPR-specific data. In this paper, we present an effective approach to harness the potential of a foundation model for VPR. We show that features extracted from self-attention layers can act as a powerful re-ranker for VPR, even in a zero-shot setting. Our method not only outperforms previous zero-shot approaches but also introduces results competitive with several supervised methods. We then show that a single-stage approach utilizing internal ViT layers for pooling can produce global features that achieve state-of-the-art performance, with impressive feature compactness down to 128D. Moreover, integrating our local foundation features for re-ranking further widens this performance gap. Our method also demonstrates exceptional robustness and generalization, setting new state-of-the-art performance, while handling challenging conditions such as occlusion, day-night transitions, and seasonal variations.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Visual Place Recognition | AmsterTime | Recall@1 | 65.5 | EffoVPR |
| Visual Place Recognition | Nordland | Recall@1 | 95 | EffoVPR |
| Visual Place Recognition | Nordland | Recall@5 | 98.6 | EffoVPR |
| Visual Place Recognition | San Francisco Landmark Dataset | Recall@1 | 93 | EffoVPR |
| Visual Place Recognition | SF-XL test v1 | Recall@1 | 95.5 | EffoVPR |
| Visual Place Recognition | SF-XL test v1 | Recall@10 | 98.1 | EffoVPR |
| Visual Place Recognition | St Lucia | Recall@1 | 100 | EffoVPR |
| Visual Place Recognition | St Lucia | Recall@5 | 100 | EffoVPR |
| Visual Place Recognition | Pittsburgh-30k-test | Recall@1 | 93.9 | EffoVPR |
| Visual Place Recognition | Pittsburgh-30k-test | Recall@5 | 97.4 | EffoVPR |
| Visual Place Recognition | Tokyo247 | Recall@1 | 98.7 | EffoVPR |
| Visual Place Recognition | Tokyo247 | Recall@10 | 98.7 | EffoVPR |
| Visual Place Recognition | Tokyo247 | Recall@5 | 98.7 | EffoVPR |
| Visual Place Recognition | SF-XL test v2 | Recall@1 | 94.5 | EffoVPR |
| Visual Place Recognition | SF-XL test v2 | Recall@10 | 97.8 | EffoVPR |
| Visual Place Recognition | SF-XL test v2 | Recall@5 | 98.2 | EffoVPR |
| Visual Place Recognition | Mapillary val | Recall@1 | 92.8 | EffoVPR |
| Visual Place Recognition | Mapillary val | Recall@10 | 97.4 | EffoVPR |
| Visual Place Recognition | Mapillary val | Recall@5 | 97.2 | EffoVPR |
| Visual Place Recognition | Mapillary test | Recall@1 | 79 | EffoVPR |
| Visual Place Recognition | Mapillary test | Recall@10 | 91.6 | EffoVPR |
| Visual Place Recognition | Mapillary test | Recall@5 | 89 | EffoVPR |
| Visual Place Recognition | Eynsham | Recall@1 | 91 | EffoVPR |