TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/P2T: Pyramid Pooling Transformer for Scene Understanding

P2T: Pyramid Pooling Transformer for Scene Understanding

Yu-Huan Wu, Yun Liu, Xin Zhan, Ming-Ming Cheng

2021-06-22Image ClassificationScene UnderstandingSemantic SegmentationInstance Segmentationobject-detectionObject DetectionRGB Salient Object DetectionSaliency Detection
PaperPDFCode(official)CodeCodeCode

Abstract

Recently, the vision transformer has achieved great success by pushing the state-of-the-art of various vision tasks. One of the most challenging problems in the vision transformer is that the large sequence length of image tokens leads to high computational cost (quadratic complexity). A popular solution to this problem is to use a single pooling operation to reduce the sequence length. This paper considers how to improve existing vision transformers, where the pooled feature extracted by a single pooling operation seems less powerful. To this end, we note that pyramid pooling has been demonstrated to be effective in various vision tasks owing to its powerful ability in context abstraction. However, pyramid pooling has not been explored in backbone network design. To bridge this gap, we propose to adapt pyramid pooling to Multi-Head Self-Attention (MHSA) in the vision transformer, simultaneously reducing the sequence length and capturing powerful contextual features. Plugged with our pooling-based MHSA, we build a universal vision transformer backbone, dubbed Pyramid Pooling Transformer (P2T). Extensive experiments demonstrate that, when applied P2T as the backbone network, it shows substantial superiority in various vision tasks such as image classification, semantic segmentation, object detection, and instance segmentation, compared to previous CNN- and transformer-based networks. The code will be released at https://github.com/yuhuan-wu/P2T.

Results

TaskDatasetMetricValueModel
Object DetectionDUTS-TEMAE0.029P2T-Small
Object DetectionDUTS-TEmax F-measure0.912P2T-Small
Object DetectionDUTS-TEMAE0.033P2T-Tiny
Object DetectionDUTS-TEmax F-measure0.895P2T-Tiny
3DDUTS-TEMAE0.029P2T-Small
3DDUTS-TEmax F-measure0.912P2T-Small
3DDUTS-TEMAE0.033P2T-Tiny
3DDUTS-TEmax F-measure0.895P2T-Tiny
RGB Salient Object DetectionDUTS-TEMAE0.029P2T-Small
RGB Salient Object DetectionDUTS-TEmax F-measure0.912P2T-Small
RGB Salient Object DetectionDUTS-TEMAE0.033P2T-Tiny
RGB Salient Object DetectionDUTS-TEmax F-measure0.895P2T-Tiny
2D ClassificationDUTS-TEMAE0.029P2T-Small
2D ClassificationDUTS-TEmax F-measure0.912P2T-Small
2D ClassificationDUTS-TEMAE0.033P2T-Tiny
2D ClassificationDUTS-TEmax F-measure0.895P2T-Tiny
2D Object DetectionDUTS-TEMAE0.029P2T-Small
2D Object DetectionDUTS-TEmax F-measure0.912P2T-Small
2D Object DetectionDUTS-TEMAE0.033P2T-Tiny
2D Object DetectionDUTS-TEmax F-measure0.895P2T-Tiny
16kDUTS-TEMAE0.029P2T-Small
16kDUTS-TEmax F-measure0.912P2T-Small
16kDUTS-TEMAE0.033P2T-Tiny
16kDUTS-TEmax F-measure0.895P2T-Tiny

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21Automatic Classification and Segmentation of Tunnel Cracks Based on Deep Learning and Visual Explanations2025-07-18Adversarial attacks to image classification systems using evolutionary algorithms2025-07-17Efficient Adaptation of Pre-trained Vision Transformer underpinned by Approximately Orthogonal Fine-Tuning Strategy2025-07-17Federated Learning for Commercial Image Sources2025-07-17MUPAX: Multidimensional Problem Agnostic eXplainable AI2025-07-17Advancing Complex Wide-Area Scene Understanding with Hierarchical Coresets Selection2025-07-17Argus: Leveraging Multiview Images for Improved 3-D Scene Understanding With Large Language Models2025-07-17