TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/SkySense: A Multi-Modal Remote Sensing Foundation Model To...

SkySense: A Multi-Modal Remote Sensing Foundation Model Towards Universal Interpretation for Earth Observation Imagery

Xin Guo, Jiangwei Lao, Bo Dang, Yingying Zhang, Lei Yu, Lixiang Ru, Liheng Zhong, Ziyuan Huang, Kang Wu, Dingxiang Hu, Huimei He, Jian Wang, Jingdong Chen, Ming Yang, Yongjun Zhang, Yansheng Li

2023-12-15CVPR 2024 1Image ClassificationOpen Vocabulary Semantic SegmentationContrastive LearningZero-shot Classification (unified classes)Temporal SequencesVisual Question Answering
PaperPDFCode

Abstract

Prior studies on Remote Sensing Foundation Model (RSFM) reveal immense potential towards a generic model for Earth Observation. Nevertheless, these works primarily focus on a single modality without temporal and geo-context modeling, hampering their capabilities for diverse tasks. In this study, we present SkySense, a generic billion-scale model, pre-trained on a curated multi-modal Remote Sensing Imagery (RSI) dataset with 21.5 million temporal sequences. SkySense incorporates a factorized multi-modal spatiotemporal encoder taking temporal sequences of optical and Synthetic Aperture Radar (SAR) data as input. This encoder is pre-trained by our proposed Multi-Granularity Contrastive Learning to learn representations across different modal and spatial granularities. To further enhance the RSI representations by the geo-context clue, we introduce Geo-Context Prototype Learning to learn region-aware prototypes upon RSI's multi-modal spatiotemporal features. To our best knowledge, SkySense is the largest Multi-Modal RSFM to date, whose modules can be flexibly combined or used individually to accommodate various tasks. It demonstrates remarkable generalization capabilities on a thorough evaluation encompassing 16 datasets over 7 tasks, from single- to multi-modal, static to temporal, and classification to localization. SkySense surpasses 18 recent RSFMs in all test scenarios. Specifically, it outperforms the latest models such as GFM, SatLas and Scale-MAE by a large margin, i.e., 2.76%, 3.67% and 3.61% on average respectively. We will release the pre-trained weights to facilitate future research and Earth Observation applications.

Results

TaskDatasetMetricValueModel
Visual Question Answering (VQA)SIRI-WHUAcc. (test)74.79SkySense-O
Visual Question Answering (VQA)RSVQA-HRzero-shot Acc78.09SkySense-O
Visual Question Answering (VQA)AID-VQAAcc. (test)94.1SkySense-O
Image ClassificationRESISC45zero-shot Acc83.28SkySense-O
Open Vocabulary Semantic SegmentationSIORmIoU30.89SkySense-O
Open Vocabulary Semantic SegmentationSOTAmIoU32.12SkySense-O
Open Vocabulary Semantic SegmentationFASTmIoU8.3SkySense-O
Open Vocabulary Semantic SegmentationISPRS PotsdammIoU54.1SkySense-O
Open Vocabulary Semantic SegmentationiSAIDmIoU-43.9SkySense-O
Visual Question AnsweringSIRI-WHUAcc. (test)74.79SkySense-O
Visual Question AnsweringRSVQA-HRzero-shot Acc78.09SkySense-O
Visual Question AnsweringAID-VQAAcc. (test)94.1SkySense-O

Related Papers

Automatic Classification and Segmentation of Tunnel Cracks Based on Deep Learning and Visual Explanations2025-07-18Adversarial attacks to image classification systems using evolutionary algorithms2025-07-17Efficient Adaptation of Pre-trained Vision Transformer underpinned by Approximately Orthogonal Fine-Tuning Strategy2025-07-17Federated Learning for Commercial Image Sources2025-07-17MUPAX: Multidimensional Problem Agnostic eXplainable AI2025-07-17SemCSE: Semantic Contrastive Sentence Embeddings Using LLM-Generated Summaries For Scientific Abstracts2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17Overview of the TalentCLEF 2025: Skill and Job Title Intelligence for Human Capital Management2025-07-17