G3: An Effective and Adaptive Framework for Worldwide Geolocalization Using Large Multi-Modality Models

Pengyue Jia, Yiding Liu, Xiaopeng Li, Yuhao Wang, Yantong Du, Xiao Han, Xuetao Wei, Shuaiqiang Wang, Dawei Yin, Xiangyu Zhao

2024-05-23Retrieval Photo geolocation estimation RAG

Paper PDF Code(official)

Abstract

Worldwide geolocalization aims to locate the precise location at the coordinate level of photos taken anywhere on the Earth. It is very challenging due to 1) the difficulty of capturing subtle location-aware visual semantics, and 2) the heterogeneous geographical distribution of image data. As a result, existing studies have clear limitations when scaled to a worldwide context. They may easily confuse distant images with similar visual contents, or cannot adapt to various locations worldwide with different amounts of relevant data. To resolve these limitations, we propose G3, a novel framework based on Retrieval-Augmented Generation (RAG). In particular, G3 consists of three steps, i.e., Geo-alignment, Geo-diversification, and Geo-verification to optimize both retrieval and generation phases of worldwide geolocalization. During Geo-alignment, our solution jointly learns expressive multi-modal representations for images, GPS and textual descriptions, which allows us to capture location-aware semantics for retrieving nearby images for a given query. During Geo-diversification, we leverage a prompt ensembling method that is robust to inconsistent retrieval performance for different image queries. Finally, we combine both retrieved and generated GPS candidates in Geo-verification for location prediction. Experiments on two well-established datasets IM2GPS3k and YFCC4k verify the superiority of G3 compared to other state-of-the-art methods. Our code and data are available online for reproduction.

Results

Task	Dataset	Metric	Value	Model
Image Classification	Im2GPS3k	City level (25 km)	40.94	G3
Image Classification	Im2GPS3k	Continent level (2500 km)	84.68	G3
Image Classification	Im2GPS3k	Country level (750 km)	71.24	G3
Image Classification	Im2GPS3k	Region level (200 km)	55.56	G3
Image Classification	Im2GPS3k	Street level (1 km)	16.65	G3
Image Classification	YFCC4k	City (25 km)	35.89	G3
Image Classification	YFCC4k	Continent (2500 km)	78.15	G3
Image Classification	YFCC4k	Country (750 km)	64.26	G3
Image Classification	YFCC4k	Region (200 km)	46.98	G3
Image Classification	YFCC4k	Street (1 km)	23.99	G3
4K 60Fps	Im2GPS3k	City level (25 km)	40.94	G3
4K 60Fps	Im2GPS3k	Continent level (2500 km)	84.68	G3
4K 60Fps	Im2GPS3k	Country level (750 km)	71.24	G3
4K 60Fps	Im2GPS3k	Region level (200 km)	55.56	G3
4K 60Fps	Im2GPS3k	Street level (1 km)	16.65	G3
4K 60Fps	YFCC4k	City (25 km)	35.89	G3
4K 60Fps	YFCC4k	Continent (2500 km)	78.15	G3
4K 60Fps	YFCC4k	Country (750 km)	64.26	G3
4K 60Fps	YFCC4k	Region (200 km)	46.98	G3
4K 60Fps	YFCC4k	Street (1 km)	23.99	G3

G3: An Effective and Adaptive Framework for Worldwide Geolocalization Using Large Multi-Modality Models

Abstract

Results

Related Papers

G3: An Effective and Adaptive Framework for Worldwide Geolocalization Using Large Multi-Modality Models

Abstract

Results

Related Papers