TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/PaLI: A Jointly-Scaled Multilingual Language-Image Model

PaLI: A Jointly-Scaled Multilingual Language-Image Model

Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, Alexander Kolesnikov, Joan Puigcerver, Nan Ding, Keran Rong, Hassan Akbari, Gaurav Mishra, Linting Xue, Ashish Thapliyal, James Bradbury, Weicheng Kuo, Mojtaba Seyedhosseini, Chao Jia, Burcu Karagol Ayan, Carlos Riquelme, Andreas Steiner, Anelia Angelova, Xiaohua Zhai, Neil Houlsby, Radu Soricut

2022-09-14Question AnsweringImage ClassificationZero-Shot Image ClassificationFew-Shot Image ClassificationImage CaptioningVisual ReasoningZero-Shot Transfer Image ClassificationVisual Question Answering (VQA)Visual Question Answering
PaperPDFCode(official)

Abstract

Effective scaling and a flexible task interface enable large language models to excel at many tasks. We present PaLI (Pathways Language and Image model), a model that extends this approach to the joint modeling of language and vision. PaLI generates text based on visual and textual inputs, and with this interface performs many vision, language, and multimodal tasks, in many languages. To train PaLI, we make use of large pre-trained encoder-decoder language models and Vision Transformers (ViTs). This allows us to capitalize on their existing capabilities and leverage the substantial cost of training them. We find that joint scaling of the vision and language components is important. Since existing Transformers for language are much larger than their vision counterparts, we train a large, 4-billion parameter ViT (ViT-e) to quantify the benefits from even larger-capacity vision models. To train PaLI, we create a large multilingual mix of pretraining tasks, based on a new image-text training set containing 10B images and texts in over 100 languages. PaLI achieves state-of-the-art in multiple vision and language tasks (such as captioning, visual question-answering, scene-text understanding), while retaining a simple, modular, and scalable design.

Results

TaskDatasetMetricValueModel
Visual Question Answering (VQA)TextVQA test-standardoverall73.1PaLI
Visual Question Answering (VQA)VizWiz 2020 VQAoverall73.3PaLI
Visual Question Answering (VQA)OK-VQAAccuracy64.5PaLI 17B
Visual Question Answering (VQA)VQA v2 test-devAccuracy84.3PaLI
Image Captioningnocaps near-domainB188.57PaLI
Image Captioningnocaps near-domainB275.56PaLI
Image Captioningnocaps near-domainB358.99PaLI
Image Captioningnocaps near-domainB439.98PaLI
Image Captioningnocaps near-domainCIDEr124.35PaLI
Image Captioningnocaps near-domainMETEOR33.47PaLI
Image Captioningnocaps near-domainROUGE-L63.99PaLI
Image Captioningnocaps near-domainSPICE15.75PaLI
Image Captioningnocaps near-domainSPICE15.75PaLI
Image Captioningnocaps out-of-domainB186.28PaLI
Image Captioningnocaps out-of-domainB271.19PaLI
Image Captioningnocaps out-of-domainB352.63PaLI
Image Captioningnocaps out-of-domainB432PaLI
Image Captioningnocaps out-of-domainCIDEr126.67PaLI
Image Captioningnocaps out-of-domainMETEOR30.99PaLI
Image Captioningnocaps out-of-domainROUGE-L61.35PaLI
Image Captioningnocaps out-of-domainSPICE15.49PaLI
Image Captioningnocaps in-domainCIDEr149.1PaLI
Image Captioningnocaps in-domainB188.02PaLI
Image Captioningnocaps in-domainB275.21PaLI
Image Captioningnocaps in-domainB359.38PaLI
Image Captioningnocaps in-domainB441.16PaLI
Image Captioningnocaps in-domainCIDEr121.09PaLI
Image Captioningnocaps in-domainMETEOR34.22PaLI
Image Captioningnocaps in-domainROUGE-L64.39PaLI
Image Captioningnocaps in-domainSPICE15.69PaLI
Image ClassificationImageNet V2Top 1 Accuracy84.3ViT-e
Image ClassificationObjectNetTop-1 Accuracy72ViT-e
Zero-Shot Transfer Image ClassificationImageNet V2Accuracy (Private)80.6LiT ViT-e
Zero-Shot Transfer Image ClassificationImageNet V2Accuracy (Private)64.46PaLI
Zero-Shot Transfer Image ClassificationImageNet-AAccuracy (Private)88LiT ViT-e
Zero-Shot Transfer Image ClassificationImageNet-AAccuracy (Private)44.7PaLI
Zero-Shot Transfer Image ClassificationImageNetAccuracy (Private)85.4LiT ViT-e
Zero-Shot Transfer Image ClassificationImageNetAccuracy (Private)72.11PaLI
Zero-Shot Transfer Image ClassificationImageNet-RAccuracy96.1LiT ViT-e
Zero-Shot Transfer Image ClassificationImageNet-RAccuracy81.97PaLI
Zero-Shot Transfer Image ClassificationObjectNetAccuracy (Private)84.9LiT ViT-e
Zero-Shot Transfer Image ClassificationObjectNetAccuracy (Private)42.62PaLI
Zero-Shot Transfer Image ClassificationObjectNetTop 5 Accuracy58.35PaLI
Zero-Shot Transfer Image ClassificationImageNet-SAccuracy (Private)63.83PaLI
Zero-Shot Transfer Image ClassificationImageNet-STop 5 Accuracy79.3PaLI

Related Papers

Automatic Classification and Segmentation of Tunnel Cracks Based on Deep Learning and Visual Explanations2025-07-18From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17Adversarial attacks to image classification systems using evolutionary algorithms2025-07-17Efficient Adaptation of Pre-trained Vision Transformer underpinned by Approximately Orthogonal Fine-Tuning Strategy2025-07-17Federated Learning for Commercial Image Sources2025-07-17