TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/MetaFormer Is Actually What You Need for Vision

MetaFormer Is Actually What You Need for Vision

Weihao Yu, Mi Luo, Pan Zhou, Chenyang Si, Yichen Zhou, Xinchao Wang, Jiashi Feng, Shuicheng Yan

2021-11-22CVPR 2022 1Image ClassificationSemantic SegmentationRecommendation SystemsObject Detection
PaperPDFCodeCodeCode(official)CodeCodeCodeCodeCodeCodeCodeCode(official)CodeCodeCodeCodeCodeCodeCode

Abstract

Transformers have shown great potential in computer vision tasks. A common belief is their attention-based token mixer module contributes most to their competence. However, recent works show the attention-based module in Transformers can be replaced by spatial MLPs and the resulted models still perform quite well. Based on this observation, we hypothesize that the general architecture of the Transformers, instead of the specific token mixer module, is more essential to the model's performance. To verify this, we deliberately replace the attention module in Transformers with an embarrassingly simple spatial pooling operator to conduct only basic token mixing. Surprisingly, we observe that the derived model, termed as PoolFormer, achieves competitive performance on multiple computer vision tasks. For example, on ImageNet-1K, PoolFormer achieves 82.1% top-1 accuracy, surpassing well-tuned Vision Transformer/MLP-like baselines DeiT-B/ResMLP-B24 by 0.3%/1.1% accuracy with 35%/52% fewer parameters and 50%/62% fewer MACs. The effectiveness of PoolFormer verifies our hypothesis and urges us to initiate the concept of "MetaFormer", a general architecture abstracted from Transformers without specifying the token mixer. Based on the extensive experiments, we argue that MetaFormer is the key player in achieving superior results for recent Transformer and MLP-like models on vision tasks. This work calls for more future research dedicated to improving MetaFormer instead of focusing on the token mixer modules. Additionally, our proposed PoolFormer could serve as a starting baseline for future MetaFormer architecture design. Code is available at https://github.com/sail-sg/poolformer.

Results

TaskDatasetMetricValueModel
Semantic SegmentationADE20KValidation mIoU42.7PoolFormer-M48
Object DetectionCOCO minivalAP5063.1PoolFormer-S36 (Mask R-CNN)
Object DetectionCOCO minivalAP7544.8PoolFormer-S36 (Mask R-CNN)
Object DetectionCOCO minivalbox AP41PoolFormer-S36 (Mask R-CNN)
Image ClassificationImageNetGFLOPs23.2MetaFormer PoolFormer-M48
3DCOCO minivalAP5063.1PoolFormer-S36 (Mask R-CNN)
3DCOCO minivalAP7544.8PoolFormer-S36 (Mask R-CNN)
3DCOCO minivalbox AP41PoolFormer-S36 (Mask R-CNN)
2D ClassificationCOCO minivalAP5063.1PoolFormer-S36 (Mask R-CNN)
2D ClassificationCOCO minivalAP7544.8PoolFormer-S36 (Mask R-CNN)
2D ClassificationCOCO minivalbox AP41PoolFormer-S36 (Mask R-CNN)
2D Object DetectionCOCO minivalAP5063.1PoolFormer-S36 (Mask R-CNN)
2D Object DetectionCOCO minivalAP7544.8PoolFormer-S36 (Mask R-CNN)
2D Object DetectionCOCO minivalbox AP41PoolFormer-S36 (Mask R-CNN)
10-shot image generationADE20KValidation mIoU42.7PoolFormer-M48
16kCOCO minivalAP5063.1PoolFormer-S36 (Mask R-CNN)
16kCOCO minivalAP7544.8PoolFormer-S36 (Mask R-CNN)
16kCOCO minivalbox AP41PoolFormer-S36 (Mask R-CNN)

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21Automatic Classification and Segmentation of Tunnel Cracks Based on Deep Learning and Visual Explanations2025-07-18IP2: Entity-Guided Interest Probing for Personalized News Recommendation2025-07-18A Reproducibility Study of Product-side Fairness in Bundle Recommendation2025-07-18Adversarial attacks to image classification systems using evolutionary algorithms2025-07-17Efficient Adaptation of Pre-trained Vision Transformer underpinned by Approximately Orthogonal Fine-Tuning Strategy2025-07-17Federated Learning for Commercial Image Sources2025-07-17MUPAX: Multidimensional Problem Agnostic eXplainable AI2025-07-17