MaxViT: Multi-Axis Vision Transformer

Zhengzhong Tu, Hossein Talebi, Han Zhang, Feng Yang, Peyman Milanfar, Alan Bovik, Yinxiao Li

2022-04-04Image Classification object-detection Object Detection

Paper PDF Code(official)Code Code Code Code Code Code Code Code Code Code Code Code Code Code

Abstract

Transformers have recently gained significant attention in the computer vision community. However, the lack of scalability of self-attention mechanisms with respect to image size has limited their wide adoption in state-of-the-art vision backbones. In this paper we introduce an efficient and scalable attention model we call multi-axis attention, which consists of two aspects: blocked local and dilated global attention. These design choices allow global-local spatial interactions on arbitrary input resolutions with only linear complexity. We also present a new architectural element by effectively blending our proposed attention model with convolutions, and accordingly propose a simple hierarchical vision backbone, dubbed MaxViT, by simply repeating the basic building block over multiple stages. Notably, MaxViT is able to ''see'' globally throughout the entire network, even in earlier, high-resolution stages. We demonstrate the effectiveness of our model on a broad spectrum of vision tasks. On image classification, MaxViT achieves state-of-the-art performance under various settings: without extra data, MaxViT attains 86.5% ImageNet-1K top-1 accuracy; with ImageNet-21K pre-training, our model achieves 88.7% top-1 accuracy. For downstream tasks, MaxViT as a backbone delivers favorable performance on object detection as well as visual aesthetic assessment. We also show that our proposed model expresses strong generative modeling capability on ImageNet, demonstrating the superior potential of MaxViT blocks as a universal vision module. The source code and trained models will be available at https://github.com/google-research/maxvit.

Results

Task	Dataset	Metric	Value	Model
Object Detection	COCO 2017	AP	53.4	MaxViT-B
Object Detection	COCO 2017	AP50	72.9	MaxViT-B
Object Detection	COCO 2017	AP75	58.1	MaxViT-B
Object Detection	COCO 2017	APM	45.7	MaxViT-B
Object Detection	COCO 2017	APM50	70.3	MaxViT-B
Object Detection	COCO 2017	APM75	50	MaxViT-B
Object Detection	COCO 2017	AP	53.1	MaxViT-S
Object Detection	COCO 2017	AP50	72.5	MaxViT-S
Object Detection	COCO 2017	AP75	58.1	MaxViT-S
Object Detection	COCO 2017	APM	45.4	MaxViT-S
Object Detection	COCO 2017	APM50	69.8	MaxViT-S
Object Detection	COCO 2017	APM75	49.5	MaxViT-S
Object Detection	COCO 2017	AP	52.1	MaxViT-T
Object Detection	COCO 2017	AP50	71.9	MaxViT-T
Object Detection	COCO 2017	AP75	56.8	MaxViT-T
Object Detection	COCO 2017	APM	44.6	MaxViT-T
Object Detection	COCO 2017	APM50	69.1	MaxViT-T
Object Detection	COCO 2017	APM75	48.4	MaxViT-T
Image Classification	ImageNet	GFLOPs	43.9	MaxViT-L (224res)
Image Classification	ImageNet	GFLOPs	23.4	MaxViT-B (224res)
Image Classification	ImageNet	GFLOPs	11.7	MaxViT-S (224res)
Image Classification	ImageNet	GFLOPs	5.6	MaxViT-T (224res)
3D	COCO 2017	AP	53.4	MaxViT-B
3D	COCO 2017	AP50	72.9	MaxViT-B
3D	COCO 2017	AP75	58.1	MaxViT-B
3D	COCO 2017	APM	45.7	MaxViT-B
3D	COCO 2017	APM50	70.3	MaxViT-B
3D	COCO 2017	APM75	50	MaxViT-B
3D	COCO 2017	AP	53.1	MaxViT-S
3D	COCO 2017	AP50	72.5	MaxViT-S
3D	COCO 2017	AP75	58.1	MaxViT-S
3D	COCO 2017	APM	45.4	MaxViT-S
3D	COCO 2017	APM50	69.8	MaxViT-S
3D	COCO 2017	APM75	49.5	MaxViT-S
3D	COCO 2017	AP	52.1	MaxViT-T
3D	COCO 2017	AP50	71.9	MaxViT-T
3D	COCO 2017	AP75	56.8	MaxViT-T
3D	COCO 2017	APM	44.6	MaxViT-T
3D	COCO 2017	APM50	69.1	MaxViT-T
3D	COCO 2017	APM75	48.4	MaxViT-T
2D Classification	COCO 2017	AP	53.4	MaxViT-B
2D Classification	COCO 2017	AP50	72.9	MaxViT-B
2D Classification	COCO 2017	AP75	58.1	MaxViT-B
2D Classification	COCO 2017	APM	45.7	MaxViT-B
2D Classification	COCO 2017	APM50	70.3	MaxViT-B
2D Classification	COCO 2017	APM75	50	MaxViT-B
2D Classification	COCO 2017	AP	53.1	MaxViT-S
2D Classification	COCO 2017	AP50	72.5	MaxViT-S
2D Classification	COCO 2017	AP75	58.1	MaxViT-S
2D Classification	COCO 2017	APM	45.4	MaxViT-S
2D Classification	COCO 2017	APM50	69.8	MaxViT-S
2D Classification	COCO 2017	APM75	49.5	MaxViT-S
2D Classification	COCO 2017	AP	52.1	MaxViT-T
2D Classification	COCO 2017	AP50	71.9	MaxViT-T
2D Classification	COCO 2017	AP75	56.8	MaxViT-T
2D Classification	COCO 2017	APM	44.6	MaxViT-T
2D Classification	COCO 2017	APM50	69.1	MaxViT-T
2D Classification	COCO 2017	APM75	48.4	MaxViT-T
2D Object Detection	COCO 2017	AP	53.4	MaxViT-B
2D Object Detection	COCO 2017	AP50	72.9	MaxViT-B
2D Object Detection	COCO 2017	AP75	58.1	MaxViT-B
2D Object Detection	COCO 2017	APM	45.7	MaxViT-B
2D Object Detection	COCO 2017	APM50	70.3	MaxViT-B
2D Object Detection	COCO 2017	APM75	50	MaxViT-B
2D Object Detection	COCO 2017	AP	53.1	MaxViT-S
2D Object Detection	COCO 2017	AP50	72.5	MaxViT-S
2D Object Detection	COCO 2017	AP75	58.1	MaxViT-S
2D Object Detection	COCO 2017	APM	45.4	MaxViT-S
2D Object Detection	COCO 2017	APM50	69.8	MaxViT-S
2D Object Detection	COCO 2017	APM75	49.5	MaxViT-S
2D Object Detection	COCO 2017	AP	52.1	MaxViT-T
2D Object Detection	COCO 2017	AP50	71.9	MaxViT-T
2D Object Detection	COCO 2017	AP75	56.8	MaxViT-T
2D Object Detection	COCO 2017	APM	44.6	MaxViT-T
2D Object Detection	COCO 2017	APM50	69.1	MaxViT-T
2D Object Detection	COCO 2017	APM75	48.4	MaxViT-T
16k	COCO 2017	AP	53.4	MaxViT-B
16k	COCO 2017	AP50	72.9	MaxViT-B
16k	COCO 2017	AP75	58.1	MaxViT-B
16k	COCO 2017	APM	45.7	MaxViT-B
16k	COCO 2017	APM50	70.3	MaxViT-B
16k	COCO 2017	APM75	50	MaxViT-B
16k	COCO 2017	AP	53.1	MaxViT-S
16k	COCO 2017	AP50	72.5	MaxViT-S
16k	COCO 2017	AP75	58.1	MaxViT-S
16k	COCO 2017	APM	45.4	MaxViT-S
16k	COCO 2017	APM50	69.8	MaxViT-S
16k	COCO 2017	APM75	49.5	MaxViT-S
16k	COCO 2017	AP	52.1	MaxViT-T
16k	COCO 2017	AP50	71.9	MaxViT-T
16k	COCO 2017	AP75	56.8	MaxViT-T
16k	COCO 2017	APM	44.6	MaxViT-T
16k	COCO 2017	APM50	69.1	MaxViT-T
16k	COCO 2017	APM75	48.4	MaxViT-T

MaxViT: Multi-Axis Vision Transformer

Abstract

Results

Related Papers

MaxViT: Multi-Axis Vision Transformer

Abstract

Results

Related Papers