TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Rethinking Decoders for Transformer-based Semantic Segment...

Rethinking Decoders for Transformer-based Semantic Segmentation: A Compression Perspective

Qishuai Wen, Chun-Guang Li

2024-11-05SegmentationSemantic Segmentation
PaperPDFCode(official)

Abstract

State-of-the-art methods for Transformer-based semantic segmentation typically adopt Transformer decoders that are used to extract additional embeddings from image embeddings via cross-attention, refine either or both types of embeddings via self-attention, and project image embeddings onto the additional embeddings via dot-product. Despite their remarkable success, these empirical designs still lack theoretical justifications or interpretations, thus hindering potentially principled improvements. In this paper, we argue that there are fundamental connections between semantic segmentation and compression, especially between the Transformer decoders and Principal Component Analysis (PCA). From such a perspective, we derive a white-box, fully attentional DEcoder for PrIncipled semantiC segemenTation (DEPICT), with the interpretations as follows: 1) the self-attention operator refines image embeddings to construct an ideal principal subspace that aligns with the supervision and retains most information; 2) the cross-attention operator seeks to find a low-rank approximation of the refined image embeddings, which is expected to be a set of orthonormal bases of the principal subspace and corresponds to the predefined classes; 3) the dot-product operation yields compact representation for image embeddings as segmentation masks. Experiments conducted on dataset ADE20K find that DEPICT consistently outperforms its black-box counterpart, Segmenter, and it is light weight and more robust.

Results

TaskDatasetMetricValueModel
Semantic SegmentationCityscapes valmIoU81DEPICT-SA (ViT-L multi-scale)
Semantic SegmentationCityscapes valmIoU78.8DEPICT-SA (ViT-L single-scale)
Semantic SegmentationADE20K valmIoU54.3DEPICT-SA (ViT-L 640x640 multi-scale)
Semantic SegmentationADE20K valmIoU52.9DEPICT-SA (ViT-L 640x640 single-scale)
Semantic SegmentationPASCAL ContextmIoU58.6DEPICT-SA (ViT-L multi-scale)
Semantic SegmentationPASCAL ContextmIoU57.9DEPICT-SA (ViT-L single-scale)
10-shot image generationCityscapes valmIoU81DEPICT-SA (ViT-L multi-scale)
10-shot image generationCityscapes valmIoU78.8DEPICT-SA (ViT-L single-scale)
10-shot image generationADE20K valmIoU54.3DEPICT-SA (ViT-L 640x640 multi-scale)
10-shot image generationADE20K valmIoU52.9DEPICT-SA (ViT-L 640x640 single-scale)
10-shot image generationPASCAL ContextmIoU58.6DEPICT-SA (ViT-L multi-scale)
10-shot image generationPASCAL ContextmIoU57.9DEPICT-SA (ViT-L single-scale)

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21Deep Learning-Based Fetal Lung Segmentation from Diffusion-weighted MRI Images and Lung Maturity Evaluation for Fetal Growth Restriction2025-07-17DiffOSeg: Omni Medical Image Segmentation via Multi-Expert Collaboration Diffusion Model2025-07-17From Variability To Accuracy: Conditional Bernoulli Diffusion Models with Consensus-Driven Correction for Thin Structure Segmentation2025-07-17Unleashing Vision Foundation Models for Coronary Artery Segmentation: Parallel ViT-CNN Encoding and Variational Fusion2025-07-17SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation2025-07-17Unified Medical Image Segmentation with State Space Modeling Snake2025-07-17A Privacy-Preserving Semantic-Segmentation Method Using Domain-Adaptation Technique2025-07-17