TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/DaViT: Dual Attention Vision Transformers

DaViT: Dual Attention Vision Transformers

Mingyu Ding, Bin Xiao, Noel Codella, Ping Luo, Jingdong Wang, Lu Yuan

2022-04-07Image ClassificationSemantic SegmentationMedical Image ClassificationInstance SegmentationObject Detection
PaperPDFCodeCodeCodeCode(official)

Abstract

In this work, we introduce Dual Attention Vision Transformers (DaViT), a simple yet effective vision transformer architecture that is able to capture global context while maintaining computational efficiency. We propose approaching the problem from an orthogonal angle: exploiting self-attention mechanisms with both "spatial tokens" and "channel tokens". With spatial tokens, the spatial dimension defines the token scope, and the channel dimension defines the token feature dimension. With channel tokens, we have the inverse: the channel dimension defines the token scope, and the spatial dimension defines the token feature dimension. We further group tokens along the sequence direction for both spatial and channel tokens to maintain the linear complexity of the entire model. We show that these two self-attentions complement each other: (i) since each channel token contains an abstract representation of the entire image, the channel attention naturally captures global interactions and representations by taking all spatial positions into account when computing attention scores between channels; (ii) the spatial attention refines the local representations by performing fine-grained interactions across spatial locations, which in turn helps the global information modeling in channel attention. Extensive experiments show our DaViT achieves state-of-the-art performance on four different tasks with efficient computations. Without extra data, DaViT-Tiny, DaViT-Small, and DaViT-Base achieve 82.8%, 84.2%, and 84.6% top-1 accuracy on ImageNet-1K with 28.3M, 49.7M, and 87.9M parameters, respectively. When we further scale up DaViT with 1.5B weakly supervised image and text pairs, DaViT-Gaint reaches 90.4% top-1 accuracy on ImageNet-1K. Code is available at https://github.com/dingmyu/davit.

Results

TaskDatasetMetricValueModel
Semantic SegmentationADE20K valmIoU48.8DaViT-S (UperNet)
Semantic SegmentationADE20K valmIoU46.3DaViT-B (UperNet)
Semantic SegmentationADE20KValidation mIoU49.4DaViT-B
Semantic SegmentationADE20KValidation mIoU46.3DaViT-T
Object DetectionCOCO minivalbox AP49.9DaViT-T (Mask R-CNN, 36 epochs)
Image ClassificationImageNetGFLOPs1038DaViT-G
Image ClassificationImageNetGFLOPs334DaViT-H
Image ClassificationImageNetGFLOPs103DaViT-L (ImageNet-22k)
Image ClassificationImageNetGFLOPs46.4DaViT-B (ImageNet-22k)
Image ClassificationImageNetGFLOPs15.5DaViT-B
3DCOCO minivalbox AP49.9DaViT-T (Mask R-CNN, 36 epochs)
Instance SegmentationCOCO minivalmask AP44.3DaViT-T (Mask R-CNN, 36 epochs)
2D ClassificationCOCO minivalbox AP49.9DaViT-T (Mask R-CNN, 36 epochs)
ClassificationImageNetGFLOPs4.5DaViT-T
ClassificationImageNetGFLOPs8.8DaViT-S
2D Object DetectionCOCO minivalbox AP49.9DaViT-T (Mask R-CNN, 36 epochs)
Medical Image ClassificationImageNetGFLOPs4.5DaViT-T
Medical Image ClassificationImageNetGFLOPs8.8DaViT-S
10-shot image generationADE20K valmIoU48.8DaViT-S (UperNet)
10-shot image generationADE20K valmIoU46.3DaViT-B (UperNet)
10-shot image generationADE20KValidation mIoU49.4DaViT-B
10-shot image generationADE20KValidation mIoU46.3DaViT-T
16kCOCO minivalbox AP49.9DaViT-T (Mask R-CNN, 36 epochs)

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21Automatic Classification and Segmentation of Tunnel Cracks Based on Deep Learning and Visual Explanations2025-07-18Adversarial attacks to image classification systems using evolutionary algorithms2025-07-17Efficient Adaptation of Pre-trained Vision Transformer underpinned by Approximately Orthogonal Fine-Tuning Strategy2025-07-17Federated Learning for Commercial Image Sources2025-07-17MUPAX: Multidimensional Problem Agnostic eXplainable AI2025-07-17DiffOSeg: Omni Medical Image Segmentation via Multi-Expert Collaboration Diffusion Model2025-07-17SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation2025-07-17