TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/MaxViT: Multi-Axis Vision Transformer

MaxViT: Multi-Axis Vision Transformer

Zhengzhong Tu, Hossein Talebi, Han Zhang, Feng Yang, Peyman Milanfar, Alan Bovik, Yinxiao Li

2022-04-04Image Classificationobject-detectionObject Detection
PaperPDFCode(official)CodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCode

Abstract

Transformers have recently gained significant attention in the computer vision community. However, the lack of scalability of self-attention mechanisms with respect to image size has limited their wide adoption in state-of-the-art vision backbones. In this paper we introduce an efficient and scalable attention model we call multi-axis attention, which consists of two aspects: blocked local and dilated global attention. These design choices allow global-local spatial interactions on arbitrary input resolutions with only linear complexity. We also present a new architectural element by effectively blending our proposed attention model with convolutions, and accordingly propose a simple hierarchical vision backbone, dubbed MaxViT, by simply repeating the basic building block over multiple stages. Notably, MaxViT is able to ''see'' globally throughout the entire network, even in earlier, high-resolution stages. We demonstrate the effectiveness of our model on a broad spectrum of vision tasks. On image classification, MaxViT achieves state-of-the-art performance under various settings: without extra data, MaxViT attains 86.5% ImageNet-1K top-1 accuracy; with ImageNet-21K pre-training, our model achieves 88.7% top-1 accuracy. For downstream tasks, MaxViT as a backbone delivers favorable performance on object detection as well as visual aesthetic assessment. We also show that our proposed model expresses strong generative modeling capability on ImageNet, demonstrating the superior potential of MaxViT blocks as a universal vision module. The source code and trained models will be available at https://github.com/google-research/maxvit.

Results

TaskDatasetMetricValueModel
Object DetectionCOCO 2017AP53.4MaxViT-B
Object DetectionCOCO 2017AP5072.9MaxViT-B
Object DetectionCOCO 2017AP7558.1MaxViT-B
Object DetectionCOCO 2017APM45.7MaxViT-B
Object DetectionCOCO 2017APM5070.3MaxViT-B
Object DetectionCOCO 2017APM7550MaxViT-B
Object DetectionCOCO 2017AP53.1MaxViT-S
Object DetectionCOCO 2017AP5072.5MaxViT-S
Object DetectionCOCO 2017AP7558.1MaxViT-S
Object DetectionCOCO 2017APM45.4MaxViT-S
Object DetectionCOCO 2017APM5069.8MaxViT-S
Object DetectionCOCO 2017APM7549.5MaxViT-S
Object DetectionCOCO 2017AP52.1MaxViT-T
Object DetectionCOCO 2017AP5071.9MaxViT-T
Object DetectionCOCO 2017AP7556.8MaxViT-T
Object DetectionCOCO 2017APM44.6MaxViT-T
Object DetectionCOCO 2017APM5069.1MaxViT-T
Object DetectionCOCO 2017APM7548.4MaxViT-T
Image ClassificationImageNetGFLOPs43.9MaxViT-L (224res)
Image ClassificationImageNetGFLOPs23.4MaxViT-B (224res)
Image ClassificationImageNetGFLOPs11.7MaxViT-S (224res)
Image ClassificationImageNetGFLOPs5.6MaxViT-T (224res)
3DCOCO 2017AP53.4MaxViT-B
3DCOCO 2017AP5072.9MaxViT-B
3DCOCO 2017AP7558.1MaxViT-B
3DCOCO 2017APM45.7MaxViT-B
3DCOCO 2017APM5070.3MaxViT-B
3DCOCO 2017APM7550MaxViT-B
3DCOCO 2017AP53.1MaxViT-S
3DCOCO 2017AP5072.5MaxViT-S
3DCOCO 2017AP7558.1MaxViT-S
3DCOCO 2017APM45.4MaxViT-S
3DCOCO 2017APM5069.8MaxViT-S
3DCOCO 2017APM7549.5MaxViT-S
3DCOCO 2017AP52.1MaxViT-T
3DCOCO 2017AP5071.9MaxViT-T
3DCOCO 2017AP7556.8MaxViT-T
3DCOCO 2017APM44.6MaxViT-T
3DCOCO 2017APM5069.1MaxViT-T
3DCOCO 2017APM7548.4MaxViT-T
2D ClassificationCOCO 2017AP53.4MaxViT-B
2D ClassificationCOCO 2017AP5072.9MaxViT-B
2D ClassificationCOCO 2017AP7558.1MaxViT-B
2D ClassificationCOCO 2017APM45.7MaxViT-B
2D ClassificationCOCO 2017APM5070.3MaxViT-B
2D ClassificationCOCO 2017APM7550MaxViT-B
2D ClassificationCOCO 2017AP53.1MaxViT-S
2D ClassificationCOCO 2017AP5072.5MaxViT-S
2D ClassificationCOCO 2017AP7558.1MaxViT-S
2D ClassificationCOCO 2017APM45.4MaxViT-S
2D ClassificationCOCO 2017APM5069.8MaxViT-S
2D ClassificationCOCO 2017APM7549.5MaxViT-S
2D ClassificationCOCO 2017AP52.1MaxViT-T
2D ClassificationCOCO 2017AP5071.9MaxViT-T
2D ClassificationCOCO 2017AP7556.8MaxViT-T
2D ClassificationCOCO 2017APM44.6MaxViT-T
2D ClassificationCOCO 2017APM5069.1MaxViT-T
2D ClassificationCOCO 2017APM7548.4MaxViT-T
2D Object DetectionCOCO 2017AP53.4MaxViT-B
2D Object DetectionCOCO 2017AP5072.9MaxViT-B
2D Object DetectionCOCO 2017AP7558.1MaxViT-B
2D Object DetectionCOCO 2017APM45.7MaxViT-B
2D Object DetectionCOCO 2017APM5070.3MaxViT-B
2D Object DetectionCOCO 2017APM7550MaxViT-B
2D Object DetectionCOCO 2017AP53.1MaxViT-S
2D Object DetectionCOCO 2017AP5072.5MaxViT-S
2D Object DetectionCOCO 2017AP7558.1MaxViT-S
2D Object DetectionCOCO 2017APM45.4MaxViT-S
2D Object DetectionCOCO 2017APM5069.8MaxViT-S
2D Object DetectionCOCO 2017APM7549.5MaxViT-S
2D Object DetectionCOCO 2017AP52.1MaxViT-T
2D Object DetectionCOCO 2017AP5071.9MaxViT-T
2D Object DetectionCOCO 2017AP7556.8MaxViT-T
2D Object DetectionCOCO 2017APM44.6MaxViT-T
2D Object DetectionCOCO 2017APM5069.1MaxViT-T
2D Object DetectionCOCO 2017APM7548.4MaxViT-T
16kCOCO 2017AP53.4MaxViT-B
16kCOCO 2017AP5072.9MaxViT-B
16kCOCO 2017AP7558.1MaxViT-B
16kCOCO 2017APM45.7MaxViT-B
16kCOCO 2017APM5070.3MaxViT-B
16kCOCO 2017APM7550MaxViT-B
16kCOCO 2017AP53.1MaxViT-S
16kCOCO 2017AP5072.5MaxViT-S
16kCOCO 2017AP7558.1MaxViT-S
16kCOCO 2017APM45.4MaxViT-S
16kCOCO 2017APM5069.8MaxViT-S
16kCOCO 2017APM7549.5MaxViT-S
16kCOCO 2017AP52.1MaxViT-T
16kCOCO 2017AP5071.9MaxViT-T
16kCOCO 2017AP7556.8MaxViT-T
16kCOCO 2017APM44.6MaxViT-T
16kCOCO 2017APM5069.1MaxViT-T
16kCOCO 2017APM7548.4MaxViT-T

Related Papers

Automatic Classification and Segmentation of Tunnel Cracks Based on Deep Learning and Visual Explanations2025-07-18Adversarial attacks to image classification systems using evolutionary algorithms2025-07-17Efficient Adaptation of Pre-trained Vision Transformer underpinned by Approximately Orthogonal Fine-Tuning Strategy2025-07-17Federated Learning for Commercial Image Sources2025-07-17MUPAX: Multidimensional Problem Agnostic eXplainable AI2025-07-17A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17RS-TinyNet: Stage-wise Feature Fusion Network for Detecting Tiny Objects in Remote Sensing Images2025-07-17Decoupled PROB: Decoupled Query Initialization Tasks and Objectness-Class Learning for Open World Object Detection2025-07-17