P2T: Pyramid Pooling Transformer for Scene Understanding

Yu-Huan Wu, Yun Liu, Xin Zhan, Ming-Ming Cheng

2021-06-22Image Classification Scene Understanding Semantic Segmentation Instance Segmentation object-detection Object Detection RGB Salient Object Detection Saliency Detection

Paper PDF Code(official)Code Code Code

Abstract

Recently, the vision transformer has achieved great success by pushing the state-of-the-art of various vision tasks. One of the most challenging problems in the vision transformer is that the large sequence length of image tokens leads to high computational cost (quadratic complexity). A popular solution to this problem is to use a single pooling operation to reduce the sequence length. This paper considers how to improve existing vision transformers, where the pooled feature extracted by a single pooling operation seems less powerful. To this end, we note that pyramid pooling has been demonstrated to be effective in various vision tasks owing to its powerful ability in context abstraction. However, pyramid pooling has not been explored in backbone network design. To bridge this gap, we propose to adapt pyramid pooling to Multi-Head Self-Attention (MHSA) in the vision transformer, simultaneously reducing the sequence length and capturing powerful contextual features. Plugged with our pooling-based MHSA, we build a universal vision transformer backbone, dubbed Pyramid Pooling Transformer (P2T). Extensive experiments demonstrate that, when applied P2T as the backbone network, it shows substantial superiority in various vision tasks such as image classification, semantic segmentation, object detection, and instance segmentation, compared to previous CNN- and transformer-based networks. The code will be released at https://github.com/yuhuan-wu/P2T.

Results

Task	Dataset	Metric	Value	Model
Object Detection	DUTS-TE	MAE	0.029	P2T-Small
Object Detection	DUTS-TE	max F-measure	0.912	P2T-Small
Object Detection	DUTS-TE	MAE	0.033	P2T-Tiny
Object Detection	DUTS-TE	max F-measure	0.895	P2T-Tiny
3D	DUTS-TE	MAE	0.029	P2T-Small
3D	DUTS-TE	max F-measure	0.912	P2T-Small
3D	DUTS-TE	MAE	0.033	P2T-Tiny
3D	DUTS-TE	max F-measure	0.895	P2T-Tiny
RGB Salient Object Detection	DUTS-TE	MAE	0.029	P2T-Small
RGB Salient Object Detection	DUTS-TE	max F-measure	0.912	P2T-Small
RGB Salient Object Detection	DUTS-TE	MAE	0.033	P2T-Tiny
RGB Salient Object Detection	DUTS-TE	max F-measure	0.895	P2T-Tiny
2D Classification	DUTS-TE	MAE	0.029	P2T-Small
2D Classification	DUTS-TE	max F-measure	0.912	P2T-Small
2D Classification	DUTS-TE	MAE	0.033	P2T-Tiny
2D Classification	DUTS-TE	max F-measure	0.895	P2T-Tiny
2D Object Detection	DUTS-TE	MAE	0.029	P2T-Small
2D Object Detection	DUTS-TE	max F-measure	0.912	P2T-Small
2D Object Detection	DUTS-TE	MAE	0.033	P2T-Tiny
2D Object Detection	DUTS-TE	max F-measure	0.895	P2T-Tiny
16k	DUTS-TE	MAE	0.029	P2T-Small
16k	DUTS-TE	max F-measure	0.912	P2T-Small
16k	DUTS-TE	MAE	0.033	P2T-Tiny
16k	DUTS-TE	max F-measure	0.895	P2T-Tiny

P2T: Pyramid Pooling Transformer for Scene Understanding

Abstract

Results

Related Papers

P2T: Pyramid Pooling Transformer for Scene Understanding

Abstract

Results

Related Papers