Global Context Vision Transformers

Ali Hatamizadeh, Hongxu Yin, Greg Heinrich, Jan Kautz, Pavlo Molchanov

2022-06-20Image Classification Segmentation Semantic Segmentation Instance Segmentation Object Detection

Paper PDF Code Code Code Code Code Code Code Code(official)

Abstract

We propose global context vision transformer (GC ViT), a novel architecture that enhances parameter and compute utilization for computer vision. Our method leverages global context self-attention modules, joint with standard local self-attention, to effectively and efficiently model both long and short-range spatial interactions, without the need for expensive operations such as computing attention masks or shifting local windows. In addition, we address the lack of the inductive bias in ViTs, and propose to leverage a modified fused inverted residual blocks in our architecture. Our proposed GC ViT achieves state-of-the-art results across image classification, object detection and semantic segmentation tasks. On ImageNet-1K dataset for classification, the variants of GC ViT with 51M, 90M and 201M parameters achieve 84.3%, 85.0% and 85.7% Top-1 accuracy, respectively, at 224 image resolution and without any pre-training, hence surpassing comparably-sized prior art such as CNN-based ConvNeXt and ViT-based MaxViT and Swin Transformer by a large margin. Pre-trained GC ViT backbones in downstream tasks of object detection, instance segmentation, and semantic segmentation using MS COCO and ADE20K datasets outperform prior work consistently. Specifically, GC ViT with a 4-scale DINO detection head achieves a box AP of 58.3 on MS COCO dataset.

Results

Task	Dataset	Metric	Value	Model
Semantic Segmentation	ADE20K	GFLOPs (512 x 512)	1348	GC ViT-B
Semantic Segmentation	ADE20K	Params (M)	125	GC ViT-B
Semantic Segmentation	ADE20K	Validation mIoU	49	GC ViT-B
Semantic Segmentation	ADE20K	GFLOPs (512 x 512)	1163	GC ViT-S
Semantic Segmentation	ADE20K	Params (M)	84	GC ViT-S
Semantic Segmentation	ADE20K	Validation mIoU	48.3	GC ViT-S
Semantic Segmentation	ADE20K	GFLOPs (512 x 512)	947	GC ViT-T
Semantic Segmentation	ADE20K	Params (M)	58	GC ViT-T
Semantic Segmentation	ADE20K	Validation mIoU	46.5	GC ViT-T
Image Classification	ImageNet	GFLOPs	14.8	GC ViT-B
Image Classification	ImageNet	GFLOPs	8.5	GC ViT-S
Image Classification	ImageNet	GFLOPs	4.7	GC ViT-T
Image Classification	ImageNet	GFLOPs	2.6	GC ViT-XT
Image Classification	ImageNet	GFLOPs	2.1	GC ViT-XXT
10-shot image generation	ADE20K	GFLOPs (512 x 512)	1348	GC ViT-B
10-shot image generation	ADE20K	Params (M)	125	GC ViT-B
10-shot image generation	ADE20K	Validation mIoU	49	GC ViT-B
10-shot image generation	ADE20K	GFLOPs (512 x 512)	1163	GC ViT-S
10-shot image generation	ADE20K	Params (M)	84	GC ViT-S
10-shot image generation	ADE20K	Validation mIoU	48.3	GC ViT-S
10-shot image generation	ADE20K	GFLOPs (512 x 512)	947	GC ViT-T
10-shot image generation	ADE20K	Params (M)	58	GC ViT-T
10-shot image generation	ADE20K	Validation mIoU	46.5	GC ViT-T

Global Context Vision Transformers

Abstract

Results

Related Papers

Global Context Vision Transformers

Abstract

Results

Related Papers