DeBiFormer: Vision Transformer with Deformable Agent Bi-level Routing Attention

Nguyen Huu Bao Long, Chenyu Zhang, Yuzhi Shi, Tsubasa Hirakawa, Takayoshi Yamashita, Tohgoroh Matsui, Hironobu Fujiyoshi

2024-10-11Image Classification Semantic Segmentation object-detection Object Detection

Paper PDF Code(official)

Abstract

Vision Transformers with various attention modules have demonstrated superior performance on vision tasks. While using sparsity-adaptive attention, such as in DAT, has yielded strong results in image classification, the key-value pairs selected by deformable points lack semantic relevance when fine-tuning for semantic segmentation tasks. The query-aware sparsity attention in BiFormer seeks to focus each query on top-k routed regions. However, during attention calculation, the selected key-value pairs are influenced by too many irrelevant queries, reducing attention on the more important ones. To address these issues, we propose the Deformable Bi-level Routing Attention (DBRA) module, which optimizes the selection of key-value pairs using agent queries and enhances the interpretability of queries in attention maps. Based on this, we introduce the Deformable Bi-level Routing Attention Transformer (DeBiFormer), a novel general-purpose vision transformer built with the DBRA module. DeBiFormer has been validated on various computer vision tasks, including image classification, object detection, and semantic segmentation, providing strong evidence of its effectiveness.Code is available at {https://github.com/maclong01/DeBiFormer}

Results

Task	Dataset	Metric	Value	Model
Semantic Segmentation	ADE20K	Validation mIoU	52	DeBiFormer-B (IN1k pretrain, Upernet 160k)
Object Detection	COCO 2017	mAP	48.5	DeBiFormer-B (IN1k pretrain, MaskRCNN 12ep)
Object Detection	COCO 2017	mAP	47.5	DeBiFormer-S (IN1k pretrain, MaskRCNN 12ep)
Object Detection	COCO 2017	mAP	47.1	DeBiFormer-B (IN1k pretrain, Retina)
Object Detection	COCO 2017	mAP	45.6	DeBiFormer-S (IN1k pretrain, Retina)
3D	COCO 2017	mAP	48.5	DeBiFormer-B (IN1k pretrain, MaskRCNN 12ep)
3D	COCO 2017	mAP	47.5	DeBiFormer-S (IN1k pretrain, MaskRCNN 12ep)
3D	COCO 2017	mAP	47.1	DeBiFormer-B (IN1k pretrain, Retina)
3D	COCO 2017	mAP	45.6	DeBiFormer-S (IN1k pretrain, Retina)
2D Classification	COCO 2017	mAP	48.5	DeBiFormer-B (IN1k pretrain, MaskRCNN 12ep)
2D Classification	COCO 2017	mAP	47.5	DeBiFormer-S (IN1k pretrain, MaskRCNN 12ep)
2D Classification	COCO 2017	mAP	47.1	DeBiFormer-B (IN1k pretrain, Retina)
2D Classification	COCO 2017	mAP	45.6	DeBiFormer-S (IN1k pretrain, Retina)
2D Object Detection	COCO 2017	mAP	48.5	DeBiFormer-B (IN1k pretrain, MaskRCNN 12ep)
2D Object Detection	COCO 2017	mAP	47.5	DeBiFormer-S (IN1k pretrain, MaskRCNN 12ep)
2D Object Detection	COCO 2017	mAP	47.1	DeBiFormer-B (IN1k pretrain, Retina)
2D Object Detection	COCO 2017	mAP	45.6	DeBiFormer-S (IN1k pretrain, Retina)
10-shot image generation	ADE20K	Validation mIoU	52	DeBiFormer-B (IN1k pretrain, Upernet 160k)
16k	COCO 2017	mAP	48.5	DeBiFormer-B (IN1k pretrain, MaskRCNN 12ep)
16k	COCO 2017	mAP	47.5	DeBiFormer-S (IN1k pretrain, MaskRCNN 12ep)
16k	COCO 2017	mAP	47.1	DeBiFormer-B (IN1k pretrain, Retina)
16k	COCO 2017	mAP	45.6	DeBiFormer-S (IN1k pretrain, Retina)

DeBiFormer: Vision Transformer with Deformable Agent Bi-level Routing Attention

Abstract

Results

Related Papers

DeBiFormer: Vision Transformer with Deformable Agent Bi-level Routing Attention

Abstract

Results

Related Papers