DAT++: Spatially Dynamic Vision Transformer with Deformable Attention

Zhuofan Xia, Xuran Pan, Shiji Song, Li Erran Li, Gao Huang

2023-09-04Image Classification Semantic Segmentation Instance Segmentation Object Detection

Abstract

Transformers have shown superior performance on various vision tasks. Their large receptive field endows Transformer models with higher representation power than their CNN counterparts. Nevertheless, simply enlarging the receptive field also raises several concerns. On the one hand, using dense attention in ViT leads to excessive memory and computational cost, and features can be influenced by irrelevant parts that are beyond the region of interests. On the other hand, the handcrafted attention adopted in PVT or Swin Transformer is data agnostic and may limit the ability to model long-range relations. To solve this dilemma, we propose a novel deformable multi-head attention module, where the positions of key and value pairs in self-attention are adaptively allocated in a data-dependent way. This flexible scheme enables the proposed deformable attention to dynamically focus on relevant regions while maintains the representation power of global attention. On this basis, we present Deformable Attention Transformer (DAT), a general vision backbone efficient and effective for visual recognition. We further build an enhanced version DAT++. Extensive experiments show that our DAT++ achieves state-of-the-art results on various visual recognition benchmarks, with 85.9% ImageNet accuracy, 54.5 and 47.0 MS-COCO instance segmentation mAP, and 51.5 ADE20K semantic segmentation mIoU.

Results

Task	Dataset	Metric	Value	Model
Semantic Segmentation	ADE20K	Validation mIoU	51.5	DAT-B++
Semantic Segmentation	ADE20K	Validation mIoU	51.2	DAT-S++
Semantic Segmentation	ADE20K	Validation mIoU	50.3	DAT-T++
Object Detection	COCO 2017	AP	50.2	DAT-S++
Object Detection	COCO 2017	AP	49.2	DAT-T++
Image Classification	ImageNet	GFLOPs	49.7	DAT-B++ (384x384)
Image Classification	ImageNet	GFLOPs	16.6	DAT-B++ (224x224)
Image Classification	ImageNet	GFLOPs	9.4	DAT-S++
Image Classification	ImageNet	GFLOPs	4.3	DAT-T++
3D	COCO 2017	AP	50.2	DAT-S++
3D	COCO 2017	AP	49.2	DAT-T++
2D Classification	COCO 2017	AP	50.2	DAT-S++
2D Classification	COCO 2017	AP	49.2	DAT-T++
2D Object Detection	COCO 2017	AP	50.2	DAT-S++
2D Object Detection	COCO 2017	AP	49.2	DAT-T++
10-shot image generation	ADE20K	Validation mIoU	51.5	DAT-B++
10-shot image generation	ADE20K	Validation mIoU	51.2	DAT-S++
10-shot image generation	ADE20K	Validation mIoU	50.3	DAT-T++
16k	COCO 2017	AP	50.2	DAT-S++
16k	COCO 2017	AP	49.2	DAT-T++

DAT++: Spatially Dynamic Vision Transformer with Deformable Attention

Abstract

Results

Related Papers

DAT++: Spatially Dynamic Vision Transformer with Deformable Attention

Abstract

Results

Related Papers