TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/BiFormer: Vision Transformer with Bi-Level Routing Attention

BiFormer: Vision Transformer with Bi-Level Routing Attention

Lei Zhu, Xinjiang Wang, Zhanghan Ke, Wayne Zhang, Rynson Lau

2023-03-15CVPR 2023 1Image ClassificationSemantic Segmentationobject-detectionObject Detection
PaperPDFCode(official)CodeCode

Abstract

As the core building block of vision transformers, attention is a powerful tool to capture long-range dependency. However, such power comes at a cost: it incurs a huge computation burden and heavy memory footprint as pairwise token interaction across all spatial locations is computed. A series of works attempt to alleviate this problem by introducing handcrafted and content-agnostic sparsity into attention, such as restricting the attention operation to be inside local windows, axial stripes, or dilated windows. In contrast to these approaches, we propose a novel dynamic sparse attention via bi-level routing to enable a more flexible allocation of computations with content awareness. Specifically, for a query, irrelevant key-value pairs are first filtered out at a coarse region level, and then fine-grained token-to-token attention is applied in the union of remaining candidate regions (\ie, routed regions). We provide a simple yet effective implementation of the proposed bi-level routing attention, which utilizes the sparsity to save both computation and memory while involving only GPU-friendly dense matrix multiplications. Built with the proposed bi-level routing attention, a new general vision transformer, named BiFormer, is then presented. As BiFormer attends to a small subset of relevant tokens in a \textbf{query adaptive} manner without distraction from other irrelevant ones, it enjoys both good performance and high computational efficiency, especially in dense prediction tasks. Empirical results across several computer vision tasks such as image classification, object detection, and semantic segmentation verify the effectiveness of our design. Code is available at \url{https://github.com/rayleizhu/BiFormer}.

Results

TaskDatasetMetricValueModel
Semantic SegmentationADE20KValidation mIoU51.7BiFormer-B (IN1k pretrain, Upernet 160k)
Semantic SegmentationADE20KValidation mIoU50.8Upernet-BiFormer-S (IN1k pretrain, Upernet 160k)
Object DetectionCOCO 2017mAP48.6BiFormer-B (IN1k pretrain, MaskRCNN 12ep)
Object DetectionCOCO 2017mAP47.8BiFormer-S (IN1k pretrain, MaskRCNN 12ep)
3DCOCO 2017mAP48.6BiFormer-B (IN1k pretrain, MaskRCNN 12ep)
3DCOCO 2017mAP47.8BiFormer-S (IN1k pretrain, MaskRCNN 12ep)
2D ClassificationCOCO 2017mAP48.6BiFormer-B (IN1k pretrain, MaskRCNN 12ep)
2D ClassificationCOCO 2017mAP47.8BiFormer-S (IN1k pretrain, MaskRCNN 12ep)
2D Object DetectionCOCO 2017mAP48.6BiFormer-B (IN1k pretrain, MaskRCNN 12ep)
2D Object DetectionCOCO 2017mAP47.8BiFormer-S (IN1k pretrain, MaskRCNN 12ep)
10-shot image generationADE20KValidation mIoU51.7BiFormer-B (IN1k pretrain, Upernet 160k)
10-shot image generationADE20KValidation mIoU50.8Upernet-BiFormer-S (IN1k pretrain, Upernet 160k)
16kCOCO 2017mAP48.6BiFormer-B (IN1k pretrain, MaskRCNN 12ep)
16kCOCO 2017mAP47.8BiFormer-S (IN1k pretrain, MaskRCNN 12ep)

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21Automatic Classification and Segmentation of Tunnel Cracks Based on Deep Learning and Visual Explanations2025-07-18Adversarial attacks to image classification systems using evolutionary algorithms2025-07-17Efficient Adaptation of Pre-trained Vision Transformer underpinned by Approximately Orthogonal Fine-Tuning Strategy2025-07-17Federated Learning for Commercial Image Sources2025-07-17MUPAX: Multidimensional Problem Agnostic eXplainable AI2025-07-17DiffOSeg: Omni Medical Image Segmentation via Multi-Expert Collaboration Diffusion Model2025-07-17SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation2025-07-17