Focal Self-attention for Local-Global Interactions in Vision Transformers

Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Xiyang Dai, Bin Xiao, Lu Yuan, Jianfeng Gao

2021-07-01Image Classification Semantic Segmentation Instance Segmentation object-detection Object Detection

Abstract

Recently, Vision Transformer and its variants have shown great promise on various computer vision tasks. The ability of capturing short- and long-range visual dependencies through self-attention is arguably the main source for the success. But it also brings challenges due to quadratic computational overhead, especially for the high-resolution vision tasks (e.g., object detection). In this paper, we present focal self-attention, a new mechanism that incorporates both fine-grained local and coarse-grained global interactions. Using this new mechanism, each token attends the closest surrounding tokens at fine granularity but the tokens far away at coarse granularity, and thus can capture both short- and long-range visual dependencies efficiently and effectively. With focal self-attention, we propose a new variant of Vision Transformer models, called Focal Transformer, which achieves superior performance over the state-of-the-art vision Transformers on a range of public image classification and object detection benchmarks. In particular, our Focal Transformer models with a moderate size of 51.1M and a larger size of 89.8M achieve 83.5 and 83.8 Top-1 accuracy, respectively, on ImageNet classification at 224x224 resolution. Using Focal Transformers as the backbones, we obtain consistent and substantial improvements over the current state-of-the-art Swin Transformers for 6 different object detection methods trained with standard 1x and 3x schedules. Our largest Focal Transformer yields 58.7/58.9 box mAPs and 50.9/51.3 mask mAPs on COCO mini-val/test-dev, and 55.4 mIoU on ADE20K for semantic segmentation, creating new SoTA on three of the most challenging computer vision tasks.

Results

Task	Dataset	Metric	Value	Model
Semantic Segmentation	ADE20K val	mIoU	55.4	Focal-L (UperNet, ImageNet-22k pretrain)
Semantic Segmentation	ADE20K	Validation mIoU	55.4	Focal-L (UperNet, ImageNet-22k pretrain)
Object Detection	COCO test-dev	box mAP	58.9	Focal-L (DyHead, multi-scale)
Object Detection	COCO minival	AP50	77.2	Focal-L (DyHead, multi-scale)
Object Detection	COCO minival	APL	73.4	Focal-L (DyHead, multi-scale)
Object Detection	COCO minival	box AP	58.7	Focal-L (DyHead, multi-scale)
3D	COCO test-dev	box mAP	58.9	Focal-L (DyHead, multi-scale)
3D	COCO minival	AP50	77.2	Focal-L (DyHead, multi-scale)
3D	COCO minival	APL	73.4	Focal-L (DyHead, multi-scale)
3D	COCO minival	box AP	58.7	Focal-L (DyHead, multi-scale)
Instance Segmentation	COCO minival	mask AP	50.9	Focal-L (HTC++, multi-scale)
Instance Segmentation	COCO test-dev	AP50	75.4	Focal-L (HTC++, multi-scale)
Instance Segmentation	COCO test-dev	AP75	56.5	Focal-L (HTC++, multi-scale)
Instance Segmentation	COCO test-dev	APL	64.2	Focal-L (HTC++, multi-scale)
Instance Segmentation	COCO test-dev	APS	35.6	Focal-L (HTC++, multi-scale)
Instance Segmentation	COCO test-dev	mask AP	51.3	Focal-L (HTC++, multi-scale)
2D Classification	COCO test-dev	box mAP	58.9	Focal-L (DyHead, multi-scale)
2D Classification	COCO minival	AP50	77.2	Focal-L (DyHead, multi-scale)
2D Classification	COCO minival	APL	73.4	Focal-L (DyHead, multi-scale)
2D Classification	COCO minival	box AP	58.7	Focal-L (DyHead, multi-scale)
2D Object Detection	COCO test-dev	box mAP	58.9	Focal-L (DyHead, multi-scale)
2D Object Detection	COCO minival	AP50	77.2	Focal-L (DyHead, multi-scale)
2D Object Detection	COCO minival	APL	73.4	Focal-L (DyHead, multi-scale)
2D Object Detection	COCO minival	box AP	58.7	Focal-L (DyHead, multi-scale)
10-shot image generation	ADE20K val	mIoU	55.4	Focal-L (UperNet, ImageNet-22k pretrain)
10-shot image generation	ADE20K	Validation mIoU	55.4	Focal-L (UperNet, ImageNet-22k pretrain)
16k	COCO test-dev	box mAP	58.9	Focal-L (DyHead, multi-scale)
16k	COCO minival	AP50	77.2	Focal-L (DyHead, multi-scale)
16k	COCO minival	APL	73.4	Focal-L (DyHead, multi-scale)
16k	COCO minival	box AP	58.7	Focal-L (DyHead, multi-scale)

Abstract

Results

Task	Dataset	Metric	Value	Model
Semantic Segmentation	ADE20K val	mIoU	55.4	Focal-L (UperNet, ImageNet-22k pretrain)
Semantic Segmentation	ADE20K	Validation mIoU	55.4	Focal-L (UperNet, ImageNet-22k pretrain)
Object Detection	COCO test-dev	box mAP	58.9	Focal-L (DyHead, multi-scale)
Object Detection	COCO minival	AP50	77.2	Focal-L (DyHead, multi-scale)
Object Detection	COCO minival	APL	73.4	Focal-L (DyHead, multi-scale)
Object Detection	COCO minival	box AP	58.7	Focal-L (DyHead, multi-scale)
3D	COCO test-dev	box mAP	58.9	Focal-L (DyHead, multi-scale)
3D	COCO minival	AP50	77.2	Focal-L (DyHead, multi-scale)
3D	COCO minival	APL	73.4	Focal-L (DyHead, multi-scale)
3D	COCO minival	box AP	58.7	Focal-L (DyHead, multi-scale)
Instance Segmentation	COCO minival	mask AP	50.9	Focal-L (HTC++, multi-scale)
Instance Segmentation	COCO test-dev	AP50	75.4	Focal-L (HTC++, multi-scale)
Instance Segmentation	COCO test-dev	AP75	56.5	Focal-L (HTC++, multi-scale)
Instance Segmentation	COCO test-dev	APL	64.2	Focal-L (HTC++, multi-scale)
Instance Segmentation	COCO test-dev	APS	35.6	Focal-L (HTC++, multi-scale)
Instance Segmentation	COCO test-dev	mask AP	51.3	Focal-L (HTC++, multi-scale)
2D Classification	COCO test-dev	box mAP	58.9	Focal-L (DyHead, multi-scale)
2D Classification	COCO minival	AP50	77.2	Focal-L (DyHead, multi-scale)
2D Classification	COCO minival	APL	73.4	Focal-L (DyHead, multi-scale)
2D Classification	COCO minival	box AP	58.7	Focal-L (DyHead, multi-scale)
2D Object Detection	COCO test-dev	box mAP	58.9	Focal-L (DyHead, multi-scale)
2D Object Detection	COCO minival	AP50	77.2	Focal-L (DyHead, multi-scale)
2D Object Detection	COCO minival	APL	73.4	Focal-L (DyHead, multi-scale)
2D Object Detection	COCO minival	box AP	58.7	Focal-L (DyHead, multi-scale)
10-shot image generation	ADE20K val	mIoU	55.4	Focal-L (UperNet, ImageNet-22k pretrain)
10-shot image generation	ADE20K	Validation mIoU	55.4	Focal-L (UperNet, ImageNet-22k pretrain)
16k	COCO test-dev	box mAP	58.9	Focal-L (DyHead, multi-scale)
16k	COCO minival	AP50	77.2	Focal-L (DyHead, multi-scale)
16k	COCO minival	APL	73.4	Focal-L (DyHead, multi-scale)
16k	COCO minival	box AP	58.7	Focal-L (DyHead, multi-scale)

Focal Self-attention for Local-Global Interactions in Vision Transformers

Abstract

Results

Related Papers

Focal Self-attention for Local-Global Interactions in Vision Transformers

Abstract

Results

Related Papers