TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/PaLI-X: On Scaling up a Multilingual Vision and Language M...

PaLI-X: On Scaling up a Multilingual Vision and Language Model

Xi Chen, Josip Djolonga, Piotr Padlewski, Basil Mustafa, Soravit Changpinyo, Jialin Wu, Carlos Riquelme Ruiz, Sebastian Goodman, Xiao Wang, Yi Tay, Siamak Shakeri, Mostafa Dehghani, Daniel Salz, Mario Lucic, Michael Tschannen, Arsha Nagrani, Hexiang Hu, Mandar Joshi, Bo Pang, Ceslee Montgomery, Paulina Pietrzyk, Marvin Ritter, AJ Piergiovanni, Matthias Minderer, Filip Pavetic, Austin Waters, Gang Li, Ibrahim Alabdulmohsin, Lucas Beyer, Julien Amelot, Kenton Lee, Andreas Peter Steiner, Yang Li, Daniel Keysers, Anurag Arnab, Yuanzhong Xu, Keran Rong, Alexander Kolesnikov, Mojtaba Seyedhosseini, Anelia Angelova, Xiaohua Zhai, Neil Houlsby, Radu Soricut

2023-05-29Question AnsweringChart Question Answeringdocument understandingTemporal/Casual QAFine-Grained Image RecognitionVideo Question AnsweringVideo CaptioningVisual Question Answering (VQA)object-detectionObject DetectionLanguage Modelling
PaperPDFCodeCode

Abstract

We present the training recipe and results of scaling up PaLI-X, a multilingual vision and language model, both in terms of size of the components and the breadth of its training task mixture. Our model achieves new levels of performance on a wide-range of varied and complex tasks, including multiple image-based captioning and question-answering tasks, image-based document understanding and few-shot (in-context) learning, as well as object detection, video question answering, and video captioning. PaLI-X advances the state-of-the-art on most vision-and-language benchmarks considered (25+ of them). Finally, we observe emerging capabilities, such as complex counting and multilingual object detection, tasks that are not explicitly in the training mix.

Results

TaskDatasetMetricValueModel
Question AnsweringNExT-QAWUPS38.3PaLI-X
Visual Question Answering (VQA)DocVQA testANLS0.868PaLI-X (Single-task FT w/ OCR)
Visual Question Answering (VQA)DocVQA testANLS0.809PaLI-X (Multi-task FT)
Visual Question Answering (VQA)DocVQA testANLS0.8PaLI-X (Single-task FT)
Visual Question Answering (VQA)InfoSeekAccuracy24PaLI-X
Visual Question Answering (VQA)OK-VQAAccuracy66.1PaLI-X (Single-task FT)
Visual Question Answering (VQA)InfographicVQAANLS54.8PaLI-X (Single-task FT w/ OCR)
Visual Question Answering (VQA)InfographicVQAANLS50.7PaLI-X (Multi-task FT)
Visual Question Answering (VQA)InfographicVQAANLS49.2PaLI-X (Single-task FT)
Visual Question Answering (VQA)ChartQA1:1 Accuracy72.3PaLI-X (Single-task FT w/ OCR)
Visual Question Answering (VQA)ChartQA1:1 Accuracy70.9PaLI-X (Single-task FT)
Visual Question Answering (VQA)ChartQA1:1 Accuracy70.6PaLI-X (Multi-task FT)
Image RecognitionOVENAccuracy23.1PaLI-X
Chart Question AnsweringChartQA1:1 Accuracy72.3PaLI-X (Single-task FT w/ OCR)
Chart Question AnsweringChartQA1:1 Accuracy70.9PaLI-X (Single-task FT)
Chart Question AnsweringChartQA1:1 Accuracy70.6PaLI-X (Multi-task FT)

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17RS-TinyNet: Stage-wise Feature Fusion Network for Detecting Tiny Objects in Remote Sensing Images2025-07-17