TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/PaLI-3 Vision Language Models: Smaller, Faster, Stronger

PaLI-3 Vision Language Models: Smaller, Faster, Stronger

Xi Chen, Xiao Wang, Lucas Beyer, Alexander Kolesnikov, Jialin Wu, Paul Voigtlaender, Basil Mustafa, Sebastian Goodman, Ibrahim Alabdulmohsin, Piotr Padlewski, Daniel Salz, Xi Xiong, Daniel Vlasic, Filip Pavetic, Keran Rong, Tianli Yu, Daniel Keysers, Xiaohua Zhai, Radu Soricut

2023-10-13Cross-Modal RetrievalImage ClassificationChart Question AnsweringTemporal/Casual QARetrievalVisual Question Answering (VQA)Language Modelling
PaperPDFCode

Abstract

This paper presents PaLI-3, a smaller, faster, and stronger vision language model (VLM) that compares favorably to similar models that are 10x larger. As part of arriving at this strong performance, we compare Vision Transformer (ViT) models pretrained using classification objectives to contrastively (SigLIP) pretrained ones. We find that, while slightly underperforming on standard image classification benchmarks, SigLIP-based PaLI shows superior performance across various multimodal benchmarks, especially on localization and visually-situated text understanding. We scale the SigLIP image encoder up to 2 billion parameters, and achieves a new state-of-the-art on multilingual cross-modal retrieval. We hope that PaLI-3, at only 5B parameters, rekindles research on fundamental pieces of complex VLMs, and could fuel a new generation of scaled-up models.

Results

TaskDatasetMetricValueModel
Question AnsweringNExT-QAWUPS37.7PaLI-3
Visual Question Answering (VQA)DocVQA testANLS0.886PaLI-3 (w/ OCR)
Visual Question Answering (VQA)DocVQA testANLS0.876PaLI-3
Visual Question Answering (VQA)InfographicVQAANLS62.4PaLI-3 (w/ OCR)
Visual Question Answering (VQA)InfographicVQAANLS57.8PaLI-3
Visual Question Answering (VQA)ChartQA1:1 Accuracy70PaLI-3
Visual Question Answering (VQA)ChartQA1:1 Accuracy69.5PaLI-3 (w/ OCR)
Chart Question AnsweringChartQA1:1 Accuracy70PaLI-3
Chart Question AnsweringChartQA1:1 Accuracy69.5PaLI-3 (w/ OCR)

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21Automatic Classification and Segmentation of Tunnel Cracks Based on Deep Learning and Visual Explanations2025-07-18Adversarial attacks to image classification systems using evolutionary algorithms2025-07-17Efficient Adaptation of Pre-trained Vision Transformer underpinned by Approximately Orthogonal Fine-Tuning Strategy2025-07-17Federated Learning for Commercial Image Sources2025-07-17MUPAX: Multidimensional Problem Agnostic eXplainable AI2025-07-17From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17