TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/DAT++: Spatially Dynamic Vision Transformer with Deformabl...

DAT++: Spatially Dynamic Vision Transformer with Deformable Attention

Zhuofan Xia, Xuran Pan, Shiji Song, Li Erran Li, Gao Huang

2023-09-04Image ClassificationSemantic SegmentationInstance SegmentationObject Detection
PaperPDFCode(official)

Abstract

Transformers have shown superior performance on various vision tasks. Their large receptive field endows Transformer models with higher representation power than their CNN counterparts. Nevertheless, simply enlarging the receptive field also raises several concerns. On the one hand, using dense attention in ViT leads to excessive memory and computational cost, and features can be influenced by irrelevant parts that are beyond the region of interests. On the other hand, the handcrafted attention adopted in PVT or Swin Transformer is data agnostic and may limit the ability to model long-range relations. To solve this dilemma, we propose a novel deformable multi-head attention module, where the positions of key and value pairs in self-attention are adaptively allocated in a data-dependent way. This flexible scheme enables the proposed deformable attention to dynamically focus on relevant regions while maintains the representation power of global attention. On this basis, we present Deformable Attention Transformer (DAT), a general vision backbone efficient and effective for visual recognition. We further build an enhanced version DAT++. Extensive experiments show that our DAT++ achieves state-of-the-art results on various visual recognition benchmarks, with 85.9% ImageNet accuracy, 54.5 and 47.0 MS-COCO instance segmentation mAP, and 51.5 ADE20K semantic segmentation mIoU.

Results

TaskDatasetMetricValueModel
Semantic SegmentationADE20KValidation mIoU51.5DAT-B++
Semantic SegmentationADE20KValidation mIoU51.2DAT-S++
Semantic SegmentationADE20KValidation mIoU50.3DAT-T++
Object DetectionCOCO 2017AP50.2DAT-S++
Object DetectionCOCO 2017AP49.2DAT-T++
Image ClassificationImageNetGFLOPs49.7DAT-B++ (384x384)
Image ClassificationImageNetGFLOPs16.6DAT-B++ (224x224)
Image ClassificationImageNetGFLOPs9.4DAT-S++
Image ClassificationImageNetGFLOPs4.3DAT-T++
3DCOCO 2017AP50.2DAT-S++
3DCOCO 2017AP49.2DAT-T++
2D ClassificationCOCO 2017AP50.2DAT-S++
2D ClassificationCOCO 2017AP49.2DAT-T++
2D Object DetectionCOCO 2017AP50.2DAT-S++
2D Object DetectionCOCO 2017AP49.2DAT-T++
10-shot image generationADE20KValidation mIoU51.5DAT-B++
10-shot image generationADE20KValidation mIoU51.2DAT-S++
10-shot image generationADE20KValidation mIoU50.3DAT-T++
16kCOCO 2017AP50.2DAT-S++
16kCOCO 2017AP49.2DAT-T++

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21Automatic Classification and Segmentation of Tunnel Cracks Based on Deep Learning and Visual Explanations2025-07-18Adversarial attacks to image classification systems using evolutionary algorithms2025-07-17Efficient Adaptation of Pre-trained Vision Transformer underpinned by Approximately Orthogonal Fine-Tuning Strategy2025-07-17Federated Learning for Commercial Image Sources2025-07-17MUPAX: Multidimensional Problem Agnostic eXplainable AI2025-07-17DiffOSeg: Omni Medical Image Segmentation via Multi-Expert Collaboration Diffusion Model2025-07-17SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation2025-07-17