TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Dilated Neighborhood Attention Transformer

Dilated Neighborhood Attention Transformer

Ali Hassani, Humphrey Shi

2022-09-29Panoptic SegmentationImage ClassificationSegmentationSemantic SegmentationInstance SegmentationObject Detection
PaperPDFCode(official)CodeCodeCodeCode(official)CodeCode

Abstract

Transformers are quickly becoming one of the most heavily applied deep learning architectures across modalities, domains, and tasks. In vision, on top of ongoing efforts into plain transformers, hierarchical transformers have also gained significant attention, thanks to their performance and easy integration into existing frameworks. These models typically employ localized attention mechanisms, such as the sliding-window Neighborhood Attention (NA) or Swin Transformer's Shifted Window Self Attention. While effective at reducing self attention's quadratic complexity, local attention weakens two of the most desirable properties of self attention: long range inter-dependency modeling, and global receptive field. In this paper, we introduce Dilated Neighborhood Attention (DiNA), a natural, flexible and efficient extension to NA that can capture more global context and expand receptive fields exponentially at no additional cost. NA's local attention and DiNA's sparse global attention complement each other, and therefore we introduce Dilated Neighborhood Attention Transformer (DiNAT), a new hierarchical vision transformer built upon both. DiNAT variants enjoy significant improvements over strong baselines such as NAT, Swin, and ConvNeXt. Our large model is faster and ahead of its Swin counterpart by 1.6% box AP in COCO object detection, 1.4% mask AP in COCO instance segmentation, and 1.4% mIoU in ADE20K semantic segmentation. Paired with new frameworks, our large variant is the new state of the art panoptic segmentation model on COCO (58.5 PQ) and ADE20K (49.4 PQ), and instance segmentation model on Cityscapes (45.1 AP) and ADE20K (35.4 AP) (no extra data). It also matches the state of the art specialized semantic segmentation models on ADE20K (58.1 mIoU), and ranks second on Cityscapes (84.5 mIoU) (no extra data).

Results

TaskDatasetMetricValueModel
Semantic SegmentationCityscapes valmIoU84.5DiNAT-L (Mask2Former)
Semantic SegmentationADE20K valmIoU58.1DiNAT-L (Mask2Former)
Semantic SegmentationADE20KValidation mIoU58.1DiNAT-L (Mask2Former)
Semantic SegmentationADE20KValidation mIoU54.9DiNAT-Large (UperNet)
Semantic SegmentationADE20KValidation mIoU54.6DiNAT_s-Large (UperNet)
Semantic SegmentationADE20KValidation mIoU50.4DiNAT-Base (UperNet)
Semantic SegmentationADE20KValidation mIoU49.9DiNAT-Small (UperNet)
Semantic SegmentationADE20KValidation mIoU48.8DiNAT-Tiny (UperNet)
Semantic SegmentationADE20KValidation mIoU47.2DiNAT-Mini (UperNet)
Semantic SegmentationCityscapes valAP44.5DiNAT-L (Mask2Former)
Semantic SegmentationCityscapes valPQ67.2DiNAT-L (Mask2Former)
Semantic SegmentationCityscapes valmIoU83.4DiNAT-L (Mask2Former)
Semantic SegmentationADE20K valAP35DiNAT-L (Mask2Former, 640x640)
Semantic SegmentationADE20K valPQ49.4DiNAT-L (Mask2Former, 640x640)
Semantic SegmentationADE20K valmIoU56.3DiNAT-L (Mask2Former, 640x640)
Semantic SegmentationCOCO minivalAP49.2DiNAT-L (single-scale, Mask2Former)
Semantic SegmentationCOCO minivalPQ58.5DiNAT-L (single-scale, Mask2Former)
Semantic SegmentationCOCO minivalPQst48.8DiNAT-L (single-scale, Mask2Former)
Semantic SegmentationCOCO minivalPQth64.9DiNAT-L (single-scale, Mask2Former)
Semantic SegmentationCOCO minivalmIoU68.3DiNAT-L (single-scale, Mask2Former)
Image ClassificationImageNetGFLOPs92.4DiNAT-Large (11x11ks; 384res; Pretrained on IN22K@224)
Image ClassificationImageNetGFLOPs89.7DiNAT-Large (384x384; Pretrained on ImageNet-22K @ 224x224)
Image ClassificationImageNetGFLOPs101.5DiNAT_s-Large (384res; Pretrained on IN22K@224)
Image ClassificationImageNetGFLOPs34.5DiNAT_s-Large (224x224; Pretrained on ImageNet-22K @ 224x224)
Image ClassificationImageNetGFLOPs13.7DiNAT-Base
Image ClassificationImageNetGFLOPs7.8DiNAT-Small
Image ClassificationImageNetGFLOPs4.3DiNAT-Tiny
Image ClassificationImageNetGFLOPs2.7DiNAT-Mini
Instance SegmentationCOCO minivalAP5075DiNAT-L (single-scale, Mask2Former)
Instance SegmentationCOCO minivalmask AP50.8DiNAT-L (single-scale, Mask2Former)
Instance SegmentationCityscapes valAP5072.6DiNAT-L (single-scale, Mask2Former)
Instance SegmentationCityscapes valmask AP45.1DiNAT-L (single-scale, Mask2Former)
Instance SegmentationADE20K valAP35.4DiNAT-L (Mask2Former, single-scale)
Instance SegmentationADE20K valAPL55.5DiNAT-L (Mask2Former, single-scale)
Instance SegmentationADE20K valAPM39DiNAT-L (Mask2Former, single-scale)
Instance SegmentationADE20K valAPS16.3DiNAT-L (Mask2Former, single-scale)
10-shot image generationCityscapes valmIoU84.5DiNAT-L (Mask2Former)
10-shot image generationADE20K valmIoU58.1DiNAT-L (Mask2Former)
10-shot image generationADE20KValidation mIoU58.1DiNAT-L (Mask2Former)
10-shot image generationADE20KValidation mIoU54.9DiNAT-Large (UperNet)
10-shot image generationADE20KValidation mIoU54.6DiNAT_s-Large (UperNet)
10-shot image generationADE20KValidation mIoU50.4DiNAT-Base (UperNet)
10-shot image generationADE20KValidation mIoU49.9DiNAT-Small (UperNet)
10-shot image generationADE20KValidation mIoU48.8DiNAT-Tiny (UperNet)
10-shot image generationADE20KValidation mIoU47.2DiNAT-Mini (UperNet)
10-shot image generationCityscapes valAP44.5DiNAT-L (Mask2Former)
10-shot image generationCityscapes valPQ67.2DiNAT-L (Mask2Former)
10-shot image generationCityscapes valmIoU83.4DiNAT-L (Mask2Former)
10-shot image generationADE20K valAP35DiNAT-L (Mask2Former, 640x640)
10-shot image generationADE20K valPQ49.4DiNAT-L (Mask2Former, 640x640)
10-shot image generationADE20K valmIoU56.3DiNAT-L (Mask2Former, 640x640)
10-shot image generationCOCO minivalAP49.2DiNAT-L (single-scale, Mask2Former)
10-shot image generationCOCO minivalPQ58.5DiNAT-L (single-scale, Mask2Former)
10-shot image generationCOCO minivalPQst48.8DiNAT-L (single-scale, Mask2Former)
10-shot image generationCOCO minivalPQth64.9DiNAT-L (single-scale, Mask2Former)
10-shot image generationCOCO minivalmIoU68.3DiNAT-L (single-scale, Mask2Former)
Panoptic SegmentationCityscapes valAP44.5DiNAT-L (Mask2Former)
Panoptic SegmentationCityscapes valPQ67.2DiNAT-L (Mask2Former)
Panoptic SegmentationCityscapes valmIoU83.4DiNAT-L (Mask2Former)
Panoptic SegmentationADE20K valAP35DiNAT-L (Mask2Former, 640x640)
Panoptic SegmentationADE20K valPQ49.4DiNAT-L (Mask2Former, 640x640)
Panoptic SegmentationADE20K valmIoU56.3DiNAT-L (Mask2Former, 640x640)
Panoptic SegmentationCOCO minivalAP49.2DiNAT-L (single-scale, Mask2Former)
Panoptic SegmentationCOCO minivalPQ58.5DiNAT-L (single-scale, Mask2Former)
Panoptic SegmentationCOCO minivalPQst48.8DiNAT-L (single-scale, Mask2Former)
Panoptic SegmentationCOCO minivalPQth64.9DiNAT-L (single-scale, Mask2Former)
Panoptic SegmentationCOCO minivalmIoU68.3DiNAT-L (single-scale, Mask2Former)

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21Automatic Classification and Segmentation of Tunnel Cracks Based on Deep Learning and Visual Explanations2025-07-18Adversarial attacks to image classification systems using evolutionary algorithms2025-07-17Efficient Adaptation of Pre-trained Vision Transformer underpinned by Approximately Orthogonal Fine-Tuning Strategy2025-07-17Federated Learning for Commercial Image Sources2025-07-17MUPAX: Multidimensional Problem Agnostic eXplainable AI2025-07-17Deep Learning-Based Fetal Lung Segmentation from Diffusion-weighted MRI Images and Lung Maturity Evaluation for Fetal Growth Restriction2025-07-17DiffOSeg: Omni Medical Image Segmentation via Multi-Expert Collaboration Diffusion Model2025-07-17