Rethinking Decoders for Transformer-based Semantic Segmentation: A Compression Perspective

Qishuai Wen, Chun-Guang Li

2024-11-05Segmentation Semantic Segmentation

Abstract

State-of-the-art methods for Transformer-based semantic segmentation typically adopt Transformer decoders that are used to extract additional embeddings from image embeddings via cross-attention, refine either or both types of embeddings via self-attention, and project image embeddings onto the additional embeddings via dot-product. Despite their remarkable success, these empirical designs still lack theoretical justifications or interpretations, thus hindering potentially principled improvements. In this paper, we argue that there are fundamental connections between semantic segmentation and compression, especially between the Transformer decoders and Principal Component Analysis (PCA). From such a perspective, we derive a white-box, fully attentional DEcoder for PrIncipled semantiC segemenTation (DEPICT), with the interpretations as follows: 1) the self-attention operator refines image embeddings to construct an ideal principal subspace that aligns with the supervision and retains most information; 2) the cross-attention operator seeks to find a low-rank approximation of the refined image embeddings, which is expected to be a set of orthonormal bases of the principal subspace and corresponds to the predefined classes; 3) the dot-product operation yields compact representation for image embeddings as segmentation masks. Experiments conducted on dataset ADE20K find that DEPICT consistently outperforms its black-box counterpart, Segmenter, and it is light weight and more robust.

Results

Task	Dataset	Metric	Value	Model
Semantic Segmentation	Cityscapes val	mIoU	81	DEPICT-SA (ViT-L multi-scale)
Semantic Segmentation	Cityscapes val	mIoU	78.8	DEPICT-SA (ViT-L single-scale)
Semantic Segmentation	ADE20K val	mIoU	54.3	DEPICT-SA (ViT-L 640x640 multi-scale)
Semantic Segmentation	ADE20K val	mIoU	52.9	DEPICT-SA (ViT-L 640x640 single-scale)
Semantic Segmentation	PASCAL Context	mIoU	58.6	DEPICT-SA (ViT-L multi-scale)
Semantic Segmentation	PASCAL Context	mIoU	57.9	DEPICT-SA (ViT-L single-scale)
10-shot image generation	Cityscapes val	mIoU	81	DEPICT-SA (ViT-L multi-scale)
10-shot image generation	Cityscapes val	mIoU	78.8	DEPICT-SA (ViT-L single-scale)
10-shot image generation	ADE20K val	mIoU	54.3	DEPICT-SA (ViT-L 640x640 multi-scale)
10-shot image generation	ADE20K val	mIoU	52.9	DEPICT-SA (ViT-L 640x640 single-scale)
10-shot image generation	PASCAL Context	mIoU	58.6	DEPICT-SA (ViT-L multi-scale)
10-shot image generation	PASCAL Context	mIoU	57.9	DEPICT-SA (ViT-L single-scale)

Rethinking Decoders for Transformer-based Semantic Segmentation: A Compression Perspective

Abstract

Results

Related Papers

Rethinking Decoders for Transformer-based Semantic Segmentation: A Compression Perspective

Abstract

Results

Related Papers