DaViT: Dual Attention Vision Transformers

Mingyu Ding, Bin Xiao, Noel Codella, Ping Luo, Jingdong Wang, Lu Yuan

2022-04-07Image Classification Semantic Segmentation Medical Image Classification Instance Segmentation Object Detection

Paper PDF Code Code Code Code(official)

Abstract

In this work, we introduce Dual Attention Vision Transformers (DaViT), a simple yet effective vision transformer architecture that is able to capture global context while maintaining computational efficiency. We propose approaching the problem from an orthogonal angle: exploiting self-attention mechanisms with both "spatial tokens" and "channel tokens". With spatial tokens, the spatial dimension defines the token scope, and the channel dimension defines the token feature dimension. With channel tokens, we have the inverse: the channel dimension defines the token scope, and the spatial dimension defines the token feature dimension. We further group tokens along the sequence direction for both spatial and channel tokens to maintain the linear complexity of the entire model. We show that these two self-attentions complement each other: (i) since each channel token contains an abstract representation of the entire image, the channel attention naturally captures global interactions and representations by taking all spatial positions into account when computing attention scores between channels; (ii) the spatial attention refines the local representations by performing fine-grained interactions across spatial locations, which in turn helps the global information modeling in channel attention. Extensive experiments show our DaViT achieves state-of-the-art performance on four different tasks with efficient computations. Without extra data, DaViT-Tiny, DaViT-Small, and DaViT-Base achieve 82.8%, 84.2%, and 84.6% top-1 accuracy on ImageNet-1K with 28.3M, 49.7M, and 87.9M parameters, respectively. When we further scale up DaViT with 1.5B weakly supervised image and text pairs, DaViT-Gaint reaches 90.4% top-1 accuracy on ImageNet-1K. Code is available at https://github.com/dingmyu/davit.

Results

Task	Dataset	Metric	Value	Model
Semantic Segmentation	ADE20K val	mIoU	48.8	DaViT-S (UperNet)
Semantic Segmentation	ADE20K val	mIoU	46.3	DaViT-B (UperNet)
Semantic Segmentation	ADE20K	Validation mIoU	49.4	DaViT-B
Semantic Segmentation	ADE20K	Validation mIoU	46.3	DaViT-T
Object Detection	COCO minival	box AP	49.9	DaViT-T (Mask R-CNN, 36 epochs)
Image Classification	ImageNet	GFLOPs	1038	DaViT-G
Image Classification	ImageNet	GFLOPs	334	DaViT-H
Image Classification	ImageNet	GFLOPs	103	DaViT-L (ImageNet-22k)
Image Classification	ImageNet	GFLOPs	46.4	DaViT-B (ImageNet-22k)
Image Classification	ImageNet	GFLOPs	15.5	DaViT-B
3D	COCO minival	box AP	49.9	DaViT-T (Mask R-CNN, 36 epochs)
Instance Segmentation	COCO minival	mask AP	44.3	DaViT-T (Mask R-CNN, 36 epochs)
2D Classification	COCO minival	box AP	49.9	DaViT-T (Mask R-CNN, 36 epochs)
Classification	ImageNet	GFLOPs	4.5	DaViT-T
Classification	ImageNet	GFLOPs	8.8	DaViT-S
2D Object Detection	COCO minival	box AP	49.9	DaViT-T (Mask R-CNN, 36 epochs)
Medical Image Classification	ImageNet	GFLOPs	4.5	DaViT-T
Medical Image Classification	ImageNet	GFLOPs	8.8	DaViT-S
10-shot image generation	ADE20K val	mIoU	48.8	DaViT-S (UperNet)
10-shot image generation	ADE20K val	mIoU	46.3	DaViT-B (UperNet)
10-shot image generation	ADE20K	Validation mIoU	49.4	DaViT-B
10-shot image generation	ADE20K	Validation mIoU	46.3	DaViT-T
16k	COCO minival	box AP	49.9	DaViT-T (Mask R-CNN, 36 epochs)

DaViT: Dual Attention Vision Transformers

Abstract

Results

Related Papers

DaViT: Dual Attention Vision Transformers

Abstract

Results

Related Papers