TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Focal Self-attention for Local-Global Interactions in Visi...

Focal Self-attention for Local-Global Interactions in Vision Transformers

Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Xiyang Dai, Bin Xiao, Lu Yuan, Jianfeng Gao

2021-07-01Image ClassificationSemantic SegmentationInstance Segmentationobject-detectionObject Detection
PaperPDFCodeCodeCode(official)

Abstract

Recently, Vision Transformer and its variants have shown great promise on various computer vision tasks. The ability of capturing short- and long-range visual dependencies through self-attention is arguably the main source for the success. But it also brings challenges due to quadratic computational overhead, especially for the high-resolution vision tasks (e.g., object detection). In this paper, we present focal self-attention, a new mechanism that incorporates both fine-grained local and coarse-grained global interactions. Using this new mechanism, each token attends the closest surrounding tokens at fine granularity but the tokens far away at coarse granularity, and thus can capture both short- and long-range visual dependencies efficiently and effectively. With focal self-attention, we propose a new variant of Vision Transformer models, called Focal Transformer, which achieves superior performance over the state-of-the-art vision Transformers on a range of public image classification and object detection benchmarks. In particular, our Focal Transformer models with a moderate size of 51.1M and a larger size of 89.8M achieve 83.5 and 83.8 Top-1 accuracy, respectively, on ImageNet classification at 224x224 resolution. Using Focal Transformers as the backbones, we obtain consistent and substantial improvements over the current state-of-the-art Swin Transformers for 6 different object detection methods trained with standard 1x and 3x schedules. Our largest Focal Transformer yields 58.7/58.9 box mAPs and 50.9/51.3 mask mAPs on COCO mini-val/test-dev, and 55.4 mIoU on ADE20K for semantic segmentation, creating new SoTA on three of the most challenging computer vision tasks.

Results

TaskDatasetMetricValueModel
Semantic SegmentationADE20K valmIoU55.4Focal-L (UperNet, ImageNet-22k pretrain)
Semantic SegmentationADE20KValidation mIoU55.4Focal-L (UperNet, ImageNet-22k pretrain)
Object DetectionCOCO test-devbox mAP58.9Focal-L (DyHead, multi-scale)
Object DetectionCOCO minivalAP5077.2Focal-L (DyHead, multi-scale)
Object DetectionCOCO minivalAPL73.4Focal-L (DyHead, multi-scale)
Object DetectionCOCO minivalbox AP58.7Focal-L (DyHead, multi-scale)
3DCOCO test-devbox mAP58.9Focal-L (DyHead, multi-scale)
3DCOCO minivalAP5077.2Focal-L (DyHead, multi-scale)
3DCOCO minivalAPL73.4Focal-L (DyHead, multi-scale)
3DCOCO minivalbox AP58.7Focal-L (DyHead, multi-scale)
Instance SegmentationCOCO minivalmask AP50.9Focal-L (HTC++, multi-scale)
Instance SegmentationCOCO test-devAP5075.4Focal-L (HTC++, multi-scale)
Instance SegmentationCOCO test-devAP7556.5Focal-L (HTC++, multi-scale)
Instance SegmentationCOCO test-devAPL64.2Focal-L (HTC++, multi-scale)
Instance SegmentationCOCO test-devAPS35.6Focal-L (HTC++, multi-scale)
Instance SegmentationCOCO test-devmask AP51.3Focal-L (HTC++, multi-scale)
2D ClassificationCOCO test-devbox mAP58.9Focal-L (DyHead, multi-scale)
2D ClassificationCOCO minivalAP5077.2Focal-L (DyHead, multi-scale)
2D ClassificationCOCO minivalAPL73.4Focal-L (DyHead, multi-scale)
2D ClassificationCOCO minivalbox AP58.7Focal-L (DyHead, multi-scale)
2D Object DetectionCOCO test-devbox mAP58.9Focal-L (DyHead, multi-scale)
2D Object DetectionCOCO minivalAP5077.2Focal-L (DyHead, multi-scale)
2D Object DetectionCOCO minivalAPL73.4Focal-L (DyHead, multi-scale)
2D Object DetectionCOCO minivalbox AP58.7Focal-L (DyHead, multi-scale)
10-shot image generationADE20K valmIoU55.4Focal-L (UperNet, ImageNet-22k pretrain)
10-shot image generationADE20KValidation mIoU55.4Focal-L (UperNet, ImageNet-22k pretrain)
16kCOCO test-devbox mAP58.9Focal-L (DyHead, multi-scale)
16kCOCO minivalAP5077.2Focal-L (DyHead, multi-scale)
16kCOCO minivalAPL73.4Focal-L (DyHead, multi-scale)
16kCOCO minivalbox AP58.7Focal-L (DyHead, multi-scale)

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21Automatic Classification and Segmentation of Tunnel Cracks Based on Deep Learning and Visual Explanations2025-07-18Adversarial attacks to image classification systems using evolutionary algorithms2025-07-17Efficient Adaptation of Pre-trained Vision Transformer underpinned by Approximately Orthogonal Fine-Tuning Strategy2025-07-17Federated Learning for Commercial Image Sources2025-07-17MUPAX: Multidimensional Problem Agnostic eXplainable AI2025-07-17DiffOSeg: Omni Medical Image Segmentation via Multi-Expert Collaboration Diffusion Model2025-07-17SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation2025-07-17