Gaoshuang Huang, Yang Zhou, Luying Zhao, Wenjian Gan
Cross-view geo-localization (CVGL), which involves matching and retrieving satellite images to determine the geographic location of a ground image, is crucial in GNSS-constrained scenarios. However, this task faces significant challenges due to substantial viewpoint discrepancies, the complexity of localization scenarios, and the need for global localization. To address these issues, we propose a novel CVGL framework that integrates the vision foundational model DINOv2 with an advanced feature mixer. Our framework introduces the symmetric InfoNCE loss and incorporates near-neighbor sampling and dynamic similarity sampling strategies, significantly enhancing localization accuracy. Experimental results show that our framework surpasses existing methods across multiple public and self-built datasets. To further improve globalscale performance, we have developed CV-Cities, a novel dataset for global CVGL. CV-Cities includes 223,736 ground-satellite image pairs with geolocation data, spanning sixteen cities across six continents and covering a wide range of complex scenarios, providing a challenging benchmark for CVGL. The framework trained with CV-Cities demonstrates high localization accuracy in various test cities, highlighting its strong globalization and generalization capabilities. Our datasets and codes are available at https://github.com/GaoShuang98/CVCities.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Object Localization | cvusa | Recall@1 | 99.19 | CV-Cities |
| Object Localization | cvusa | Recall@10 | 99.85 | CV-Cities |
| Object Localization | cvusa | Recall@5 | 99.8 | CV-Cities |
| Object Localization | cvusa | Recall@top1% | 99.92 | CV-Cities |
| Object Localization | cvact | Recall@1 | 92.59 | CV-Cities |
| Object Localization | cvact | Recall@1 (%) | 98.72 | CV-Cities |
| Object Localization | cvact | Recall@10 | 97.82 | CV-Cities |
| Object Localization | cvact | Recall@5 | 97.16 | CV-Cities |
| Object Localization | VIGOR Cross Area | Hit Rate | 75.97 | CV-Cities |
| Object Localization | VIGOR Cross Area | Recall@1 | 64.61 | CV-Cities |
| Object Localization | VIGOR Cross Area | Recall@1% | 98.63 | CV-Cities |
| Object Localization | VIGOR Cross Area | Recall@10 | 91.2 | CV-Cities |
| Object Localization | VIGOR Cross Area | Recall@5 | 87.48 | CV-Cities |
| Object Localization | VIGOR Same Area | Hit Rate | 90.76 | CV-Cities |
| Object Localization | VIGOR Same Area | Recall@1 | 78.27 | CV-Cities |
| Object Localization | VIGOR Same Area | Recall@1% | 99.67 | CV-Cities |
| Object Localization | VIGOR Same Area | Recall@10 | 97.52 | CV-Cities |
| Object Localization | VIGOR Same Area | Recall@5 | 96.1 | CV-Cities |
| Image Retrieval | University-1652 | AP | 95.01 | CV-Cities |
| Image Retrieval | University-1652 | Recall@1 | 97.43 | CV-Cities |
| Visual Place Recognition | CV-Cities | Recall@1 | 82.91 | CV-Cities |
| Visual Place Recognition | CV-Cities | Recall@5 | 90.14 | CV-Cities |
| Content-Based Image Retrieval | University-1652 | AP | 95.01 | CV-Cities |
| Content-Based Image Retrieval | University-1652 | Recall@1 | 97.43 | CV-Cities |