Neighborhood Attention Transformer

Ali Hassani, Steven Walton, Jiachen Li, Shen Li, Humphrey Shi

2022-04-14CVPR 2023 1Image Classification Semantic Segmentation Object Detection

Paper PDF Code(official)Code Code(official)Code Code

Abstract

We present Neighborhood Attention (NA), the first efficient and scalable sliding-window attention mechanism for vision. NA is a pixel-wise operation, localizing self attention (SA) to the nearest neighboring pixels, and therefore enjoys a linear time and space complexity compared to the quadratic complexity of SA. The sliding-window pattern allows NA's receptive field to grow without needing extra pixel shifts, and preserves translational equivariance, unlike Swin Transformer's Window Self Attention (WSA). We develop NATTEN (Neighborhood Attention Extension), a Python package with efficient C++ and CUDA kernels, which allows NA to run up to 40% faster than Swin's WSA while using up to 25% less memory. We further present Neighborhood Attention Transformer (NAT), a new hierarchical transformer design based on NA that boosts image classification and downstream vision performance. Experimental results on NAT are competitive; NAT-Tiny reaches 83.2% top-1 accuracy on ImageNet, 51.4% mAP on MS-COCO and 48.4% mIoU on ADE20K, which is 1.9% ImageNet accuracy, 1.0% COCO mAP, and 2.6% ADE20K mIoU improvement over a Swin model with similar size. To support more research based on sliding-window attention, we open source our project and release our checkpoints at: https://github.com/SHI-Labs/Neighborhood-Attention-Transformer .

Results

Task	Dataset	Metric	Value	Model
Semantic Segmentation	ADE20K	GFLOPs (512 x 512)	1137	NAT-Base
Semantic Segmentation	ADE20K	Params (M)	123	NAT-Base
Semantic Segmentation	ADE20K	Validation mIoU	49.7	NAT-Base
Semantic Segmentation	ADE20K	GFLOPs (512 x 512)	1010	NAT-Small
Semantic Segmentation	ADE20K	Params (M)	82	NAT-Small
Semantic Segmentation	ADE20K	Validation mIoU	49.5	NAT-Small
Semantic Segmentation	ADE20K	GFLOPs (512 x 512)	934	NAT-Tiny
Semantic Segmentation	ADE20K	Params (M)	58	NAT-Tiny
Semantic Segmentation	ADE20K	Validation mIoU	48.4	NAT-Tiny
Semantic Segmentation	ADE20K	GFLOPs (512 x 512)	900	NAT-Mini
Semantic Segmentation	ADE20K	Params (M)	50	NAT-Mini
Semantic Segmentation	ADE20K	Validation mIoU	46.4	NAT-Mini
Image Classification	ImageNet	GFLOPs	13.7	NAT-Base
Image Classification	ImageNet	GFLOPs	7.8	NAT-Small
Image Classification	ImageNet	GFLOPs	4.3	NAT-Tiny
Image Classification	ImageNet	GFLOPs	2.7	NAT-Mini
10-shot image generation	ADE20K	GFLOPs (512 x 512)	1137	NAT-Base
10-shot image generation	ADE20K	Params (M)	123	NAT-Base
10-shot image generation	ADE20K	Validation mIoU	49.7	NAT-Base
10-shot image generation	ADE20K	GFLOPs (512 x 512)	1010	NAT-Small
10-shot image generation	ADE20K	Params (M)	82	NAT-Small
10-shot image generation	ADE20K	Validation mIoU	49.5	NAT-Small
10-shot image generation	ADE20K	GFLOPs (512 x 512)	934	NAT-Tiny
10-shot image generation	ADE20K	Params (M)	58	NAT-Tiny
10-shot image generation	ADE20K	Validation mIoU	48.4	NAT-Tiny
10-shot image generation	ADE20K	GFLOPs (512 x 512)	900	NAT-Mini
10-shot image generation	ADE20K	Params (M)	50	NAT-Mini
10-shot image generation	ADE20K	Validation mIoU	46.4	NAT-Mini

Neighborhood Attention Transformer

Abstract

Results

Related Papers

Neighborhood Attention Transformer

Abstract

Results

Related Papers