Dilated Neighborhood Attention Transformer

Ali Hassani, Humphrey Shi

2022-09-29Panoptic Segmentation Image Classification Segmentation Semantic Segmentation Instance Segmentation Object Detection

Paper PDF Code(official)Code Code Code Code(official)Code Code

Abstract

Transformers are quickly becoming one of the most heavily applied deep learning architectures across modalities, domains, and tasks. In vision, on top of ongoing efforts into plain transformers, hierarchical transformers have also gained significant attention, thanks to their performance and easy integration into existing frameworks. These models typically employ localized attention mechanisms, such as the sliding-window Neighborhood Attention (NA) or Swin Transformer's Shifted Window Self Attention. While effective at reducing self attention's quadratic complexity, local attention weakens two of the most desirable properties of self attention: long range inter-dependency modeling, and global receptive field. In this paper, we introduce Dilated Neighborhood Attention (DiNA), a natural, flexible and efficient extension to NA that can capture more global context and expand receptive fields exponentially at no additional cost. NA's local attention and DiNA's sparse global attention complement each other, and therefore we introduce Dilated Neighborhood Attention Transformer (DiNAT), a new hierarchical vision transformer built upon both. DiNAT variants enjoy significant improvements over strong baselines such as NAT, Swin, and ConvNeXt. Our large model is faster and ahead of its Swin counterpart by 1.6% box AP in COCO object detection, 1.4% mask AP in COCO instance segmentation, and 1.4% mIoU in ADE20K semantic segmentation. Paired with new frameworks, our large variant is the new state of the art panoptic segmentation model on COCO (58.5 PQ) and ADE20K (49.4 PQ), and instance segmentation model on Cityscapes (45.1 AP) and ADE20K (35.4 AP) (no extra data). It also matches the state of the art specialized semantic segmentation models on ADE20K (58.1 mIoU), and ranks second on Cityscapes (84.5 mIoU) (no extra data).

Results

Task	Dataset	Metric	Value	Model
Semantic Segmentation	Cityscapes val	mIoU	84.5	DiNAT-L (Mask2Former)
Semantic Segmentation	ADE20K val	mIoU	58.1	DiNAT-L (Mask2Former)
Semantic Segmentation	ADE20K	Validation mIoU	58.1	DiNAT-L (Mask2Former)
Semantic Segmentation	ADE20K	Validation mIoU	54.9	DiNAT-Large (UperNet)
Semantic Segmentation	ADE20K	Validation mIoU	54.6	DiNAT_s-Large (UperNet)
Semantic Segmentation	ADE20K	Validation mIoU	50.4	DiNAT-Base (UperNet)
Semantic Segmentation	ADE20K	Validation mIoU	49.9	DiNAT-Small (UperNet)
Semantic Segmentation	ADE20K	Validation mIoU	48.8	DiNAT-Tiny (UperNet)
Semantic Segmentation	ADE20K	Validation mIoU	47.2	DiNAT-Mini (UperNet)
Semantic Segmentation	Cityscapes val	AP	44.5	DiNAT-L (Mask2Former)
Semantic Segmentation	Cityscapes val	PQ	67.2	DiNAT-L (Mask2Former)
Semantic Segmentation	Cityscapes val	mIoU	83.4	DiNAT-L (Mask2Former)
Semantic Segmentation	ADE20K val	AP	35	DiNAT-L (Mask2Former, 640x640)
Semantic Segmentation	ADE20K val	PQ	49.4	DiNAT-L (Mask2Former, 640x640)
Semantic Segmentation	ADE20K val	mIoU	56.3	DiNAT-L (Mask2Former, 640x640)
Semantic Segmentation	COCO minival	AP	49.2	DiNAT-L (single-scale, Mask2Former)
Semantic Segmentation	COCO minival	PQ	58.5	DiNAT-L (single-scale, Mask2Former)
Semantic Segmentation	COCO minival	PQst	48.8	DiNAT-L (single-scale, Mask2Former)
Semantic Segmentation	COCO minival	PQth	64.9	DiNAT-L (single-scale, Mask2Former)
Semantic Segmentation	COCO minival	mIoU	68.3	DiNAT-L (single-scale, Mask2Former)
Image Classification	ImageNet	GFLOPs	92.4	DiNAT-Large (11x11ks; 384res; Pretrained on IN22K@224)
Image Classification	ImageNet	GFLOPs	89.7	DiNAT-Large (384x384; Pretrained on ImageNet-22K @ 224x224)
Image Classification	ImageNet	GFLOPs	101.5	DiNAT_s-Large (384res; Pretrained on IN22K@224)
Image Classification	ImageNet	GFLOPs	34.5	DiNAT_s-Large (224x224; Pretrained on ImageNet-22K @ 224x224)
Image Classification	ImageNet	GFLOPs	13.7	DiNAT-Base
Image Classification	ImageNet	GFLOPs	7.8	DiNAT-Small
Image Classification	ImageNet	GFLOPs	4.3	DiNAT-Tiny
Image Classification	ImageNet	GFLOPs	2.7	DiNAT-Mini
Instance Segmentation	COCO minival	AP50	75	DiNAT-L (single-scale, Mask2Former)
Instance Segmentation	COCO minival	mask AP	50.8	DiNAT-L (single-scale, Mask2Former)
Instance Segmentation	Cityscapes val	AP50	72.6	DiNAT-L (single-scale, Mask2Former)
Instance Segmentation	Cityscapes val	mask AP	45.1	DiNAT-L (single-scale, Mask2Former)
Instance Segmentation	ADE20K val	AP	35.4	DiNAT-L (Mask2Former, single-scale)
Instance Segmentation	ADE20K val	APL	55.5	DiNAT-L (Mask2Former, single-scale)
Instance Segmentation	ADE20K val	APM	39	DiNAT-L (Mask2Former, single-scale)
Instance Segmentation	ADE20K val	APS	16.3	DiNAT-L (Mask2Former, single-scale)
10-shot image generation	Cityscapes val	mIoU	84.5	DiNAT-L (Mask2Former)
10-shot image generation	ADE20K val	mIoU	58.1	DiNAT-L (Mask2Former)
10-shot image generation	ADE20K	Validation mIoU	58.1	DiNAT-L (Mask2Former)
10-shot image generation	ADE20K	Validation mIoU	54.9	DiNAT-Large (UperNet)
10-shot image generation	ADE20K	Validation mIoU	54.6	DiNAT_s-Large (UperNet)
10-shot image generation	ADE20K	Validation mIoU	50.4	DiNAT-Base (UperNet)
10-shot image generation	ADE20K	Validation mIoU	49.9	DiNAT-Small (UperNet)
10-shot image generation	ADE20K	Validation mIoU	48.8	DiNAT-Tiny (UperNet)
10-shot image generation	ADE20K	Validation mIoU	47.2	DiNAT-Mini (UperNet)
10-shot image generation	Cityscapes val	AP	44.5	DiNAT-L (Mask2Former)
10-shot image generation	Cityscapes val	PQ	67.2	DiNAT-L (Mask2Former)
10-shot image generation	Cityscapes val	mIoU	83.4	DiNAT-L (Mask2Former)
10-shot image generation	ADE20K val	AP	35	DiNAT-L (Mask2Former, 640x640)
10-shot image generation	ADE20K val	PQ	49.4	DiNAT-L (Mask2Former, 640x640)
10-shot image generation	ADE20K val	mIoU	56.3	DiNAT-L (Mask2Former, 640x640)
10-shot image generation	COCO minival	AP	49.2	DiNAT-L (single-scale, Mask2Former)
10-shot image generation	COCO minival	PQ	58.5	DiNAT-L (single-scale, Mask2Former)
10-shot image generation	COCO minival	PQst	48.8	DiNAT-L (single-scale, Mask2Former)
10-shot image generation	COCO minival	PQth	64.9	DiNAT-L (single-scale, Mask2Former)
10-shot image generation	COCO minival	mIoU	68.3	DiNAT-L (single-scale, Mask2Former)
Panoptic Segmentation	Cityscapes val	AP	44.5	DiNAT-L (Mask2Former)
Panoptic Segmentation	Cityscapes val	PQ	67.2	DiNAT-L (Mask2Former)
Panoptic Segmentation	Cityscapes val	mIoU	83.4	DiNAT-L (Mask2Former)
Panoptic Segmentation	ADE20K val	AP	35	DiNAT-L (Mask2Former, 640x640)
Panoptic Segmentation	ADE20K val	PQ	49.4	DiNAT-L (Mask2Former, 640x640)
Panoptic Segmentation	ADE20K val	mIoU	56.3	DiNAT-L (Mask2Former, 640x640)
Panoptic Segmentation	COCO minival	AP	49.2	DiNAT-L (single-scale, Mask2Former)
Panoptic Segmentation	COCO minival	PQ	58.5	DiNAT-L (single-scale, Mask2Former)
Panoptic Segmentation	COCO minival	PQst	48.8	DiNAT-L (single-scale, Mask2Former)
Panoptic Segmentation	COCO minival	PQth	64.9	DiNAT-L (single-scale, Mask2Former)
Panoptic Segmentation	COCO minival	mIoU	68.3	DiNAT-L (single-scale, Mask2Former)

Abstract

Results

Task	Dataset	Metric	Value	Model
Semantic Segmentation	Cityscapes val	mIoU	84.5	DiNAT-L (Mask2Former)
Semantic Segmentation	ADE20K val	mIoU	58.1	DiNAT-L (Mask2Former)
Semantic Segmentation	ADE20K	Validation mIoU	58.1	DiNAT-L (Mask2Former)
Semantic Segmentation	ADE20K	Validation mIoU	54.9	DiNAT-Large (UperNet)
Semantic Segmentation	ADE20K	Validation mIoU	54.6	DiNAT_s-Large (UperNet)
Semantic Segmentation	ADE20K	Validation mIoU	50.4	DiNAT-Base (UperNet)
Semantic Segmentation	ADE20K	Validation mIoU	49.9	DiNAT-Small (UperNet)
Semantic Segmentation	ADE20K	Validation mIoU	48.8	DiNAT-Tiny (UperNet)
Semantic Segmentation	ADE20K	Validation mIoU	47.2	DiNAT-Mini (UperNet)
Semantic Segmentation	Cityscapes val	AP	44.5	DiNAT-L (Mask2Former)
Semantic Segmentation	Cityscapes val	PQ	67.2	DiNAT-L (Mask2Former)
Semantic Segmentation	Cityscapes val	mIoU	83.4	DiNAT-L (Mask2Former)
Semantic Segmentation	ADE20K val	AP	35	DiNAT-L (Mask2Former, 640x640)
Semantic Segmentation	ADE20K val	PQ	49.4	DiNAT-L (Mask2Former, 640x640)
Semantic Segmentation	ADE20K val	mIoU	56.3	DiNAT-L (Mask2Former, 640x640)
Semantic Segmentation	COCO minival	AP	49.2	DiNAT-L (single-scale, Mask2Former)
Semantic Segmentation	COCO minival	PQ	58.5	DiNAT-L (single-scale, Mask2Former)
Semantic Segmentation	COCO minival	PQst	48.8	DiNAT-L (single-scale, Mask2Former)
Semantic Segmentation	COCO minival	PQth	64.9	DiNAT-L (single-scale, Mask2Former)
Semantic Segmentation	COCO minival	mIoU	68.3	DiNAT-L (single-scale, Mask2Former)
Image Classification	ImageNet	GFLOPs	92.4	DiNAT-Large (11x11ks; 384res; Pretrained on IN22K@224)
Image Classification	ImageNet	GFLOPs	89.7	DiNAT-Large (384x384; Pretrained on ImageNet-22K @ 224x224)
Image Classification	ImageNet	GFLOPs	101.5	DiNAT_s-Large (384res; Pretrained on IN22K@224)
Image Classification	ImageNet	GFLOPs	34.5	DiNAT_s-Large (224x224; Pretrained on ImageNet-22K @ 224x224)
Image Classification	ImageNet	GFLOPs	13.7	DiNAT-Base
Image Classification	ImageNet	GFLOPs	7.8	DiNAT-Small
Image Classification	ImageNet	GFLOPs	4.3	DiNAT-Tiny
Image Classification	ImageNet	GFLOPs	2.7	DiNAT-Mini
Instance Segmentation	COCO minival	AP50	75	DiNAT-L (single-scale, Mask2Former)
Instance Segmentation	COCO minival	mask AP	50.8	DiNAT-L (single-scale, Mask2Former)
Instance Segmentation	Cityscapes val	AP50	72.6	DiNAT-L (single-scale, Mask2Former)
Instance Segmentation	Cityscapes val	mask AP	45.1	DiNAT-L (single-scale, Mask2Former)
Instance Segmentation	ADE20K val	AP	35.4	DiNAT-L (Mask2Former, single-scale)
Instance Segmentation	ADE20K val	APL	55.5	DiNAT-L (Mask2Former, single-scale)
Instance Segmentation	ADE20K val	APM	39	DiNAT-L (Mask2Former, single-scale)
Instance Segmentation	ADE20K val	APS	16.3	DiNAT-L (Mask2Former, single-scale)
10-shot image generation	Cityscapes val	mIoU	84.5	DiNAT-L (Mask2Former)
10-shot image generation	ADE20K val	mIoU	58.1	DiNAT-L (Mask2Former)
10-shot image generation	ADE20K	Validation mIoU	58.1	DiNAT-L (Mask2Former)
10-shot image generation	ADE20K	Validation mIoU	54.9	DiNAT-Large (UperNet)
10-shot image generation	ADE20K	Validation mIoU	54.6	DiNAT_s-Large (UperNet)
10-shot image generation	ADE20K	Validation mIoU	50.4	DiNAT-Base (UperNet)
10-shot image generation	ADE20K	Validation mIoU	49.9	DiNAT-Small (UperNet)
10-shot image generation	ADE20K	Validation mIoU	48.8	DiNAT-Tiny (UperNet)
10-shot image generation	ADE20K	Validation mIoU	47.2	DiNAT-Mini (UperNet)
10-shot image generation	Cityscapes val	AP	44.5	DiNAT-L (Mask2Former)
10-shot image generation	Cityscapes val	PQ	67.2	DiNAT-L (Mask2Former)
10-shot image generation	Cityscapes val	mIoU	83.4	DiNAT-L (Mask2Former)
10-shot image generation	ADE20K val	AP	35	DiNAT-L (Mask2Former, 640x640)
10-shot image generation	ADE20K val	PQ	49.4	DiNAT-L (Mask2Former, 640x640)
10-shot image generation	ADE20K val	mIoU	56.3	DiNAT-L (Mask2Former, 640x640)
10-shot image generation	COCO minival	AP	49.2	DiNAT-L (single-scale, Mask2Former)
10-shot image generation	COCO minival	PQ	58.5	DiNAT-L (single-scale, Mask2Former)
10-shot image generation	COCO minival	PQst	48.8	DiNAT-L (single-scale, Mask2Former)
10-shot image generation	COCO minival	PQth	64.9	DiNAT-L (single-scale, Mask2Former)
10-shot image generation	COCO minival	mIoU	68.3	DiNAT-L (single-scale, Mask2Former)
Panoptic Segmentation	Cityscapes val	AP	44.5	DiNAT-L (Mask2Former)
Panoptic Segmentation	Cityscapes val	PQ	67.2	DiNAT-L (Mask2Former)
Panoptic Segmentation	Cityscapes val	mIoU	83.4	DiNAT-L (Mask2Former)
Panoptic Segmentation	ADE20K val	AP	35	DiNAT-L (Mask2Former, 640x640)
Panoptic Segmentation	ADE20K val	PQ	49.4	DiNAT-L (Mask2Former, 640x640)
Panoptic Segmentation	ADE20K val	mIoU	56.3	DiNAT-L (Mask2Former, 640x640)
Panoptic Segmentation	COCO minival	AP	49.2	DiNAT-L (single-scale, Mask2Former)
Panoptic Segmentation	COCO minival	PQ	58.5	DiNAT-L (single-scale, Mask2Former)
Panoptic Segmentation	COCO minival	PQst	48.8	DiNAT-L (single-scale, Mask2Former)
Panoptic Segmentation	COCO minival	PQth	64.9	DiNAT-L (single-scale, Mask2Former)
Panoptic Segmentation	COCO minival	mIoU	68.3	DiNAT-L (single-scale, Mask2Former)

Dilated Neighborhood Attention Transformer

Abstract

Results

Related Papers

Dilated Neighborhood Attention Transformer

Abstract

Results

Related Papers