TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/UNetFormer: A UNet-like Transformer for Efficient Semantic...

UNetFormer: A UNet-like Transformer for Efficient Semantic Segmentation of Remote Sensing Urban Scene Imagery

Libo Wang, Rui Li, Ce Zhang, Shenghui Fang, Chenxi Duan, Xiaoliang Meng, Peter M. Atkinson

2021-09-18Image ClassificationScene SegmentationSegmentationSemantic SegmentationChange Detectionobject-detectionObject Detection
PaperPDFCode(official)

Abstract

Semantic segmentation of remotely sensed urban scene images is required in a wide range of practical applications, such as land cover mapping, urban change detection, environmental protection, and economic assessment.Driven by rapid developments in deep learning technologies, the convolutional neural network (CNN) has dominated semantic segmentation for many years. CNN adopts hierarchical feature representation, demonstrating strong capabilities for local information extraction. However, the local property of the convolution layer limits the network from capturing the global context. Recently, as a hot topic in the domain of computer vision, Transformer has demonstrated its great potential in global information modelling, boosting many vision-related tasks such as image classification, object detection, and particularly semantic segmentation. In this paper, we propose a Transformer-based decoder and construct a UNet-like Transformer (UNetFormer) for real-time urban scene segmentation. For efficient segmentation, the UNetFormer selects the lightweight ResNet18 as the encoder and develops an efficient global-local attention mechanism to model both global and local information in the decoder. Extensive experiments reveal that our method not only runs faster but also produces higher accuracy compared with state-of-the-art lightweight models. Specifically, the proposed UNetFormer achieved 67.8% and 52.4% mIoU on the UAVid and LoveDA datasets, respectively, while the inference speed can achieve up to 322.4 FPS with a 512x512 input on a single NVIDIA GTX 3090 GPU. In further exploration, the proposed Transformer-based decoder combined with a Swin Transformer encoder also achieves the state-of-the-art result (91.3% F1 and 84.1% mIoU) on the Vaihingen dataset. The source code will be freely available at https://github.com/WangLibo1995/GeoSeg.

Results

TaskDatasetMetricValueModel
Semantic Segmentation US3DmIoU74.77UNetFormer
Semantic SegmentationLoveDACategory mIoU52.4UNetFormer
Semantic Segmentation PotsdammIoU85.18UnetFormer
Semantic SegmentationISPRS VaihingenAverage F191.3FT-UNetFormer
Semantic SegmentationISPRS VaihingenCategory mIoU84.1FT-UNetFormer
Semantic SegmentationISPRS VaihingenOverall Accuracy91.6FT-UNetFormer
Semantic SegmentationISPRS VaihingenAverage F190.4UNetFormer
Semantic SegmentationISPRS VaihingenCategory mIoU82.7UNetFormer
Semantic SegmentationISPRS VaihingenOverall Accuracy91UNetFormer
Semantic SegmentationISPRS PotsdamMean F193.3FT-UNetFormer
Semantic SegmentationISPRS PotsdamMean IoU87.5FT-UNetFormer
Semantic SegmentationISPRS PotsdamOverall Accuracy92FT-UNetFormer
Semantic SegmentationISPRS PotsdamMean F192.8UNetFormer
Semantic SegmentationISPRS PotsdamMean IoU86.8UNetFormer
Semantic SegmentationISPRS PotsdamOverall Accuracy91.3UNetFormer
Semantic SegmentationVaihingenmIoU77.24UnetFormer
Semantic SegmentationUAVidMean IoU67.8UNetFormer
Semantic SegmentationUAVidCategory mIoU67.8UNetFormer
Scene SegmentationUAVidCategory mIoU67.8UNetFormer
10-shot image generation US3DmIoU74.77UNetFormer
10-shot image generationLoveDACategory mIoU52.4UNetFormer
10-shot image generation PotsdammIoU85.18UnetFormer
10-shot image generationISPRS VaihingenAverage F191.3FT-UNetFormer
10-shot image generationISPRS VaihingenCategory mIoU84.1FT-UNetFormer
10-shot image generationISPRS VaihingenOverall Accuracy91.6FT-UNetFormer
10-shot image generationISPRS VaihingenAverage F190.4UNetFormer
10-shot image generationISPRS VaihingenCategory mIoU82.7UNetFormer
10-shot image generationISPRS VaihingenOverall Accuracy91UNetFormer
10-shot image generationISPRS PotsdamMean F193.3FT-UNetFormer
10-shot image generationISPRS PotsdamMean IoU87.5FT-UNetFormer
10-shot image generationISPRS PotsdamOverall Accuracy92FT-UNetFormer
10-shot image generationISPRS PotsdamMean F192.8UNetFormer
10-shot image generationISPRS PotsdamMean IoU86.8UNetFormer
10-shot image generationISPRS PotsdamOverall Accuracy91.3UNetFormer
10-shot image generationVaihingenmIoU77.24UnetFormer
10-shot image generationUAVidMean IoU67.8UNetFormer
10-shot image generationUAVidCategory mIoU67.8UNetFormer

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21Automatic Classification and Segmentation of Tunnel Cracks Based on Deep Learning and Visual Explanations2025-07-18Adversarial attacks to image classification systems using evolutionary algorithms2025-07-17Efficient Adaptation of Pre-trained Vision Transformer underpinned by Approximately Orthogonal Fine-Tuning Strategy2025-07-17Federated Learning for Commercial Image Sources2025-07-17MUPAX: Multidimensional Problem Agnostic eXplainable AI2025-07-17Deep Learning-Based Fetal Lung Segmentation from Diffusion-weighted MRI Images and Lung Maturity Evaluation for Fetal Growth Restriction2025-07-17DiffOSeg: Omni Medical Image Segmentation via Multi-Expert Collaboration Diffusion Model2025-07-17