TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/G3: An Effective and Adaptive Framework for Worldwide Geol...

G3: An Effective and Adaptive Framework for Worldwide Geolocalization Using Large Multi-Modality Models

Pengyue Jia, Yiding Liu, Xiaopeng Li, Yuhao Wang, Yantong Du, Xiao Han, Xuetao Wei, Shuaiqiang Wang, Dawei Yin, Xiangyu Zhao

2024-05-23RetrievalPhoto geolocation estimationRAG
PaperPDFCode(official)

Abstract

Worldwide geolocalization aims to locate the precise location at the coordinate level of photos taken anywhere on the Earth. It is very challenging due to 1) the difficulty of capturing subtle location-aware visual semantics, and 2) the heterogeneous geographical distribution of image data. As a result, existing studies have clear limitations when scaled to a worldwide context. They may easily confuse distant images with similar visual contents, or cannot adapt to various locations worldwide with different amounts of relevant data. To resolve these limitations, we propose G3, a novel framework based on Retrieval-Augmented Generation (RAG). In particular, G3 consists of three steps, i.e., Geo-alignment, Geo-diversification, and Geo-verification to optimize both retrieval and generation phases of worldwide geolocalization. During Geo-alignment, our solution jointly learns expressive multi-modal representations for images, GPS and textual descriptions, which allows us to capture location-aware semantics for retrieving nearby images for a given query. During Geo-diversification, we leverage a prompt ensembling method that is robust to inconsistent retrieval performance for different image queries. Finally, we combine both retrieved and generated GPS candidates in Geo-verification for location prediction. Experiments on two well-established datasets IM2GPS3k and YFCC4k verify the superiority of G3 compared to other state-of-the-art methods. Our code and data are available online for reproduction.

Results

TaskDatasetMetricValueModel
Image ClassificationIm2GPS3kCity level (25 km)40.94G3
Image ClassificationIm2GPS3kContinent level (2500 km)84.68G3
Image ClassificationIm2GPS3kCountry level (750 km)71.24G3
Image ClassificationIm2GPS3kRegion level (200 km)55.56G3
Image ClassificationIm2GPS3kStreet level (1 km)16.65G3
Image ClassificationYFCC4kCity (25 km)35.89G3
Image ClassificationYFCC4kContinent (2500 km)78.15G3
Image ClassificationYFCC4kCountry (750 km)64.26G3
Image ClassificationYFCC4kRegion (200 km)46.98G3
Image ClassificationYFCC4kStreet (1 km)23.99G3
4K 60FpsIm2GPS3kCity level (25 km)40.94G3
4K 60FpsIm2GPS3kContinent level (2500 km)84.68G3
4K 60FpsIm2GPS3kCountry level (750 km)71.24G3
4K 60FpsIm2GPS3kRegion level (200 km)55.56G3
4K 60FpsIm2GPS3kStreet level (1 km)16.65G3
4K 60FpsYFCC4kCity (25 km)35.89G3
4K 60FpsYFCC4kContinent (2500 km)78.15G3
4K 60FpsYFCC4kCountry (750 km)64.26G3
4K 60FpsYFCC4kRegion (200 km)46.98G3
4K 60FpsYFCC4kStreet (1 km)23.99G3

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17Developing Visual Augmented Q&A System using Scalable Vision Embedding Retrieval & Late Interaction Re-ranker2025-07-16Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16Context-Aware Search and Retrieval Over Erasure Channels2025-07-16Seq vs Seq: An Open Suite of Paired Encoders and Decoders2025-07-15