TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Multi-Scale High-Resolution Vision Transformer for Semanti...

Multi-Scale High-Resolution Vision Transformer for Semantic Segmentation

Jiaqi Gu, Hyoukjun Kwon, Dilin Wang, Wei Ye, Meng Li, Yu-Hsin Chen, Liangzhen Lai, Vikas Chandra, David Z. Pan

2021-11-01CVPR 2022 1Image ClassificationRepresentation LearningVocal Bursts Intensity PredictionSegmentationSemantic Segmentation
PaperPDFCode(official)

Abstract

Vision Transformers (ViTs) have emerged with superior performance on computer vision tasks compared to convolutional neural network (CNN)-based models. However, ViTs are mainly designed for image classification that generate single-scale low-resolution representations, which makes dense prediction tasks such as semantic segmentation challenging for ViTs. Therefore, we propose HRViT, which enhances ViTs to learn semantically-rich and spatially-precise multi-scale representations by integrating high-resolution multi-branch architectures with ViTs. We balance the model performance and efficiency of HRViT by various branch-block co-optimization techniques. Specifically, we explore heterogeneous branch designs, reduce the redundancy in linear layers, and augment the attention block with enhanced expressiveness. Those approaches enabled HRViT to push the Pareto frontier of performance and efficiency on semantic segmentation to a new level, as our evaluation results on ADE20K and Cityscapes show. HRViT achieves 50.20% mIoU on ADE20K and 83.16% mIoU on Cityscapes, surpassing state-of-the-art MiT and CSWin backbones with an average of +1.78 mIoU improvement, 28% parameter saving, and 21% FLOPs reduction, demonstrating the potential of HRViT as a strong vision backbone for semantic segmentation.

Results

TaskDatasetMetricValueModel
Semantic SegmentationADE20KGFLOPs (512 x 512)67.9HRViT-b3 (SegFormer, SS)
Semantic SegmentationADE20KParams (M)28.7HRViT-b3 (SegFormer, SS)
Semantic SegmentationADE20KValidation mIoU50.2HRViT-b3 (SegFormer, SS)
Semantic SegmentationADE20KGFLOPs (512 x 512)28HRViT-b2 (SegFormer, SS)
Semantic SegmentationADE20KParams (M)20.8HRViT-b2 (SegFormer, SS)
Semantic SegmentationADE20KValidation mIoU48.76HRViT-b2 (SegFormer, SS)
Semantic SegmentationADE20KGFLOPs (512 x 512)14.6HRViT-b1 (SegFormer, SS)
Semantic SegmentationADE20KParams (M)8.2HRViT-b1 (SegFormer, SS)
Semantic SegmentationADE20KValidation mIoU45.88HRViT-b1 (SegFormer, SS)
10-shot image generationADE20KGFLOPs (512 x 512)67.9HRViT-b3 (SegFormer, SS)
10-shot image generationADE20KParams (M)28.7HRViT-b3 (SegFormer, SS)
10-shot image generationADE20KValidation mIoU50.2HRViT-b3 (SegFormer, SS)
10-shot image generationADE20KGFLOPs (512 x 512)28HRViT-b2 (SegFormer, SS)
10-shot image generationADE20KParams (M)20.8HRViT-b2 (SegFormer, SS)
10-shot image generationADE20KValidation mIoU48.76HRViT-b2 (SegFormer, SS)
10-shot image generationADE20KGFLOPs (512 x 512)14.6HRViT-b1 (SegFormer, SS)
10-shot image generationADE20KParams (M)8.2HRViT-b1 (SegFormer, SS)
10-shot image generationADE20KValidation mIoU45.88HRViT-b1 (SegFormer, SS)

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21Touch in the Wild: Learning Fine-Grained Manipulation with a Portable Visuo-Tactile Gripper2025-07-20Automatic Classification and Segmentation of Tunnel Cracks Based on Deep Learning and Visual Explanations2025-07-18Adversarial attacks to image classification systems using evolutionary algorithms2025-07-17Efficient Adaptation of Pre-trained Vision Transformer underpinned by Approximately Orthogonal Fine-Tuning Strategy2025-07-17Federated Learning for Commercial Image Sources2025-07-17MUPAX: Multidimensional Problem Agnostic eXplainable AI2025-07-17Spectral Bellman Method: Unifying Representation and Exploration in RL2025-07-17