TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Scaling Vision Transformers to 22 Billion Parameters

Scaling Vision Transformers to 22 Billion Parameters

Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, Rodolphe Jenatton, Lucas Beyer, Michael Tschannen, Anurag Arnab, Xiao Wang, Carlos Riquelme, Matthias Minderer, Joan Puigcerver, Utku Evci, Manoj Kumar, Sjoerd van Steenkiste, Gamaleldin F. Elsayed, Aravindh Mahendran, Fisher Yu, Avital Oliver, Fantine Huot, Jasmijn Bastings, Mark Patrick Collier, Alexey Gritsenko, Vighnesh Birodkar, Cristina Vasconcelos, Yi Tay, Thomas Mensink, Alexander Kolesnikov, Filip Pavetić, Dustin Tran, Thomas Kipf, Mario Lučić, Xiaohua Zhai, Daniel Keysers, Jeremiah Harmsen, Neil Houlsby

2023-02-10FairnessImage ClassificationAction ClassificationObject RecognitionLinear-Probe ClassificationZero-Shot Transfer Image Classification
PaperPDFCode

Abstract

The scaling of Transformers has driven breakthrough capabilities for language models. At present, the largest large language models (LLMs) contain upwards of 100B parameters. Vision Transformers (ViT) have introduced the same architecture to image and video modelling, but these have not yet been successfully scaled to nearly the same degree; the largest dense ViT contains 4B parameters (Chen et al., 2022). We present a recipe for highly efficient and stable training of a 22B-parameter ViT (ViT-22B) and perform a wide variety of experiments on the resulting model. When evaluated on downstream tasks (often with a lightweight linear model on frozen features), ViT-22B demonstrates increasing performance with scale. We further observe other interesting benefits of scale, including an improved tradeoff between fairness and performance, state-of-the-art alignment to human visual perception in terms of shape/texture bias, and improved robustness. ViT-22B demonstrates the potential for "LLM-like" scaling in vision, and provides key steps towards getting there.

Results

TaskDatasetMetricValueModel
VideoKinetics-400Acc@188ViT-22B
Object Recognitionshape biasshape bias86.4ViT-22B-384
Object Recognitionshape biasshape bias83.8ViT-22B-560
Object Recognitionshape biasshape bias78ViT-22B-224
Zero-Shot Transfer Image ClassificationImageNet V2Accuracy (Private)80.9LiT-22B
Zero-Shot Transfer Image ClassificationImageNet-AAccuracy (Private)90.1LiT-22B
Zero-Shot Transfer Image ClassificationImageNetAccuracy (Private)85.9LiT-22B
Zero-Shot Transfer Image ClassificationImageNet-RAccuracy96LiT-22B
Zero-Shot Transfer Image ClassificationObjectNetAccuracy (Private)87.6LiT-22B

Related Papers

A Reproducibility Study of Product-side Fairness in Bundle Recommendation2025-07-18Automatic Classification and Segmentation of Tunnel Cracks Based on Deep Learning and Visual Explanations2025-07-18FedGA: A Fair Federated Learning Framework Based on the Gini Coefficient2025-07-17Adversarial attacks to image classification systems using evolutionary algorithms2025-07-17Efficient Adaptation of Pre-trained Vision Transformer underpinned by Approximately Orthogonal Fine-Tuning Strategy2025-07-17Federated Learning for Commercial Image Sources2025-07-17MUPAX: Multidimensional Problem Agnostic eXplainable AI2025-07-17Looking for Fairness in Recommender Systems2025-07-16