TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Simple, Effective and General: A New Backbone for Cross-vi...

Simple, Effective and General: A New Backbone for Cross-view Image Geo-localization

Yingying Zhu, Hongji Yang, Yuxin Lu, Qiang Huang

2023-02-03geo-localizationVisual Place RecognitionImage-Based LocalizationRetrievalImage Retrieval
PaperPDFCode(official)

Abstract

In this work, we aim at an important but less explored problem of a simple yet effective backbone specific for cross-view geo-localization task. Existing methods for cross-view geo-localization tasks are frequently characterized by 1) complicated methodologies, 2) GPU-consuming computations, and 3) a stringent assumption that aerial and ground images are centrally or orientation aligned. To address the above three challenges for cross-view image matching, we propose a new backbone network, named Simple Attention-based Image Geo-localization network (SAIG). The proposed SAIG effectively represents long-range interactions among patches as well as cross-view correspondence with multi-head self-attention layers. The "narrow-deep" architecture of our SAIG improves the feature richness without degradation in performance, while its shallow and effective convolutional stem preserves the locality, eliminating the loss of patchify boundary information. Our SAIG achieves state-of-the-art results on cross-view geo-localization, while being far simpler than previous works. Furthermore, with only 15.9% of the model parameters and half of the output dimension compared to the state-of-the-art, the SAIG adapts well across multiple cross-view datasets without employing any well-designed feature aggregation modules or feature alignment algorithms. In addition, our SAIG attains competitive scores on image retrieval benchmarks, further demonstrating its generalizability. As a backbone network, our SAIG is both easy to follow and computationally lightweight, which is meaningful in practical scenario. Moreover, we propose a simple Spatial-Mixed feature aggregation moDule (SMD) that can mix and project spatial information into a low-dimensional space to generate feature descriptors... (The code is available at https://github.com/yanghongji2007/SAIG)

Results

TaskDatasetMetricValueModel
Object LocalizationcvusaRecall@196.34SAIG-D
Object LocalizationcvusaRecall@1099.5SAIG-D
Object LocalizationcvusaRecall@599.1SAIG-D
Object LocalizationcvusaRecall@top1%99.86SAIG-D
Object LocalizationcvactRecall@189.21SAIG-D
Object LocalizationcvactRecall@1 (%)98.74SAIG-D
Object LocalizationcvactRecall@1097.04SAIG-D
Object LocalizationcvactRecall@596.07SAIG-D
Object LocalizationVIGOR Cross AreaHit Rate36.71SAIG-D
Object LocalizationVIGOR Cross AreaRecall@133.05SAIG-D
Object LocalizationVIGOR Cross AreaRecall@1%94.64SAIG-D
Object LocalizationVIGOR Cross AreaRecall@555.94SAIG-D
Object LocalizationVIGOR Same AreaHit Rate74.11SAIG-D
Object LocalizationVIGOR Same AreaRecall@165.23SAIG-D
Object LocalizationVIGOR Same AreaRecall@1%99.68SAIG-D
Object LocalizationVIGOR Same AreaRecall@588.08SAIG-D
Visual Place RecognitionCV-CitiesRecall@142.21SAIG-D
Visual Place RecognitionCV-CitiesRecall@568.73SAIG-D

Related Papers

Visual Place Recognition for Large-Scale UAV Applications2025-07-20From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17FAR-Net: Multi-Stage Fusion Network with Enhanced Semantic Alignment and Adaptive Reconciliation for Composed Image Retrieval2025-07-17Developing Visual Augmented Q&A System using Scalable Vision Embedding Retrieval & Late Interaction Re-ranker2025-07-16Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16