TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Global Context Vision Transformers

Global Context Vision Transformers

Ali Hatamizadeh, Hongxu Yin, Greg Heinrich, Jan Kautz, Pavlo Molchanov

2022-06-20Image ClassificationSegmentationSemantic SegmentationInstance SegmentationObject Detection
PaperPDFCodeCodeCodeCodeCodeCodeCodeCode(official)

Abstract

We propose global context vision transformer (GC ViT), a novel architecture that enhances parameter and compute utilization for computer vision. Our method leverages global context self-attention modules, joint with standard local self-attention, to effectively and efficiently model both long and short-range spatial interactions, without the need for expensive operations such as computing attention masks or shifting local windows. In addition, we address the lack of the inductive bias in ViTs, and propose to leverage a modified fused inverted residual blocks in our architecture. Our proposed GC ViT achieves state-of-the-art results across image classification, object detection and semantic segmentation tasks. On ImageNet-1K dataset for classification, the variants of GC ViT with 51M, 90M and 201M parameters achieve 84.3%, 85.0% and 85.7% Top-1 accuracy, respectively, at 224 image resolution and without any pre-training, hence surpassing comparably-sized prior art such as CNN-based ConvNeXt and ViT-based MaxViT and Swin Transformer by a large margin. Pre-trained GC ViT backbones in downstream tasks of object detection, instance segmentation, and semantic segmentation using MS COCO and ADE20K datasets outperform prior work consistently. Specifically, GC ViT with a 4-scale DINO detection head achieves a box AP of 58.3 on MS COCO dataset.

Results

TaskDatasetMetricValueModel
Semantic SegmentationADE20KGFLOPs (512 x 512)1348GC ViT-B
Semantic SegmentationADE20KParams (M)125GC ViT-B
Semantic SegmentationADE20KValidation mIoU49GC ViT-B
Semantic SegmentationADE20KGFLOPs (512 x 512)1163GC ViT-S
Semantic SegmentationADE20KParams (M)84GC ViT-S
Semantic SegmentationADE20KValidation mIoU48.3GC ViT-S
Semantic SegmentationADE20KGFLOPs (512 x 512)947GC ViT-T
Semantic SegmentationADE20KParams (M)58GC ViT-T
Semantic SegmentationADE20KValidation mIoU46.5GC ViT-T
Image ClassificationImageNetGFLOPs14.8GC ViT-B
Image ClassificationImageNetGFLOPs8.5GC ViT-S
Image ClassificationImageNetGFLOPs4.7GC ViT-T
Image ClassificationImageNetGFLOPs2.6GC ViT-XT
Image ClassificationImageNetGFLOPs2.1GC ViT-XXT
10-shot image generationADE20KGFLOPs (512 x 512)1348GC ViT-B
10-shot image generationADE20KParams (M)125GC ViT-B
10-shot image generationADE20KValidation mIoU49GC ViT-B
10-shot image generationADE20KGFLOPs (512 x 512)1163GC ViT-S
10-shot image generationADE20KParams (M)84GC ViT-S
10-shot image generationADE20KValidation mIoU48.3GC ViT-S
10-shot image generationADE20KGFLOPs (512 x 512)947GC ViT-T
10-shot image generationADE20KParams (M)58GC ViT-T
10-shot image generationADE20KValidation mIoU46.5GC ViT-T

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21Automatic Classification and Segmentation of Tunnel Cracks Based on Deep Learning and Visual Explanations2025-07-18Adversarial attacks to image classification systems using evolutionary algorithms2025-07-17Efficient Adaptation of Pre-trained Vision Transformer underpinned by Approximately Orthogonal Fine-Tuning Strategy2025-07-17Federated Learning for Commercial Image Sources2025-07-17MUPAX: Multidimensional Problem Agnostic eXplainable AI2025-07-17Deep Learning-Based Fetal Lung Segmentation from Diffusion-weighted MRI Images and Lung Maturity Evaluation for Fetal Growth Restriction2025-07-17DiffOSeg: Omni Medical Image Segmentation via Multi-Expert Collaboration Diffusion Model2025-07-17