PaLI: A Jointly-Scaled Multilingual Language-Image Model

Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, Alexander Kolesnikov, Joan Puigcerver, Nan Ding, Keran Rong, Hassan Akbari, Gaurav Mishra, Linting Xue, Ashish Thapliyal, James Bradbury, Weicheng Kuo, Mojtaba Seyedhosseini, Chao Jia, Burcu Karagol Ayan, Carlos Riquelme, Andreas Steiner, Anelia Angelova, Xiaohua Zhai, Neil Houlsby, Radu Soricut

2022-09-14Question Answering Image Classification Zero-Shot Image Classification Few-Shot Image Classification Image Captioning Visual Reasoning Zero-Shot Transfer Image Classification Visual Question Answering (VQA)Visual Question Answering

Paper PDF Code(official)

Abstract

Effective scaling and a flexible task interface enable large language models to excel at many tasks. We present PaLI (Pathways Language and Image model), a model that extends this approach to the joint modeling of language and vision. PaLI generates text based on visual and textual inputs, and with this interface performs many vision, language, and multimodal tasks, in many languages. To train PaLI, we make use of large pre-trained encoder-decoder language models and Vision Transformers (ViTs). This allows us to capitalize on their existing capabilities and leverage the substantial cost of training them. We find that joint scaling of the vision and language components is important. Since existing Transformers for language are much larger than their vision counterparts, we train a large, 4-billion parameter ViT (ViT-e) to quantify the benefits from even larger-capacity vision models. To train PaLI, we create a large multilingual mix of pretraining tasks, based on a new image-text training set containing 10B images and texts in over 100 languages. PaLI achieves state-of-the-art in multiple vision and language tasks (such as captioning, visual question-answering, scene-text understanding), while retaining a simple, modular, and scalable design.

Results

Task	Dataset	Metric	Value	Model
Visual Question Answering (VQA)	TextVQA test-standard	overall	73.1	PaLI
Visual Question Answering (VQA)	VizWiz 2020 VQA	overall	73.3	PaLI
Visual Question Answering (VQA)	OK-VQA	Accuracy	64.5	PaLI 17B
Visual Question Answering (VQA)	VQA v2 test-dev	Accuracy	84.3	PaLI
Image Captioning	nocaps near-domain	B1	88.57	PaLI
Image Captioning	nocaps near-domain	B2	75.56	PaLI
Image Captioning	nocaps near-domain	B3	58.99	PaLI
Image Captioning	nocaps near-domain	B4	39.98	PaLI
Image Captioning	nocaps near-domain	CIDEr	124.35	PaLI
Image Captioning	nocaps near-domain	METEOR	33.47	PaLI
Image Captioning	nocaps near-domain	ROUGE-L	63.99	PaLI
Image Captioning	nocaps near-domain	SPICE	15.75	PaLI
Image Captioning	nocaps near-domain	SPICE	15.75	PaLI
Image Captioning	nocaps out-of-domain	B1	86.28	PaLI
Image Captioning	nocaps out-of-domain	B2	71.19	PaLI
Image Captioning	nocaps out-of-domain	B3	52.63	PaLI
Image Captioning	nocaps out-of-domain	B4	32	PaLI
Image Captioning	nocaps out-of-domain	CIDEr	126.67	PaLI
Image Captioning	nocaps out-of-domain	METEOR	30.99	PaLI
Image Captioning	nocaps out-of-domain	ROUGE-L	61.35	PaLI
Image Captioning	nocaps out-of-domain	SPICE	15.49	PaLI
Image Captioning	nocaps in-domain	CIDEr	149.1	PaLI
Image Captioning	nocaps in-domain	B1	88.02	PaLI
Image Captioning	nocaps in-domain	B2	75.21	PaLI
Image Captioning	nocaps in-domain	B3	59.38	PaLI
Image Captioning	nocaps in-domain	B4	41.16	PaLI
Image Captioning	nocaps in-domain	CIDEr	121.09	PaLI
Image Captioning	nocaps in-domain	METEOR	34.22	PaLI
Image Captioning	nocaps in-domain	ROUGE-L	64.39	PaLI
Image Captioning	nocaps in-domain	SPICE	15.69	PaLI
Image Classification	ImageNet V2	Top 1 Accuracy	84.3	ViT-e
Image Classification	ObjectNet	Top-1 Accuracy	72	ViT-e
Zero-Shot Transfer Image Classification	ImageNet V2	Accuracy (Private)	80.6	LiT ViT-e
Zero-Shot Transfer Image Classification	ImageNet V2	Accuracy (Private)	64.46	PaLI
Zero-Shot Transfer Image Classification	ImageNet-A	Accuracy (Private)	88	LiT ViT-e
Zero-Shot Transfer Image Classification	ImageNet-A	Accuracy (Private)	44.7	PaLI
Zero-Shot Transfer Image Classification	ImageNet	Accuracy (Private)	85.4	LiT ViT-e
Zero-Shot Transfer Image Classification	ImageNet	Accuracy (Private)	72.11	PaLI
Zero-Shot Transfer Image Classification	ImageNet-R	Accuracy	96.1	LiT ViT-e
Zero-Shot Transfer Image Classification	ImageNet-R	Accuracy	81.97	PaLI
Zero-Shot Transfer Image Classification	ObjectNet	Accuracy (Private)	84.9	LiT ViT-e
Zero-Shot Transfer Image Classification	ObjectNet	Accuracy (Private)	42.62	PaLI
Zero-Shot Transfer Image Classification	ObjectNet	Top 5 Accuracy	58.35	PaLI
Zero-Shot Transfer Image Classification	ImageNet-S	Accuracy (Private)	63.83	PaLI
Zero-Shot Transfer Image Classification	ImageNet-S	Top 5 Accuracy	79.3	PaLI

PaLI: A Jointly-Scaled Multilingual Language-Image Model

Abstract

Results

Related Papers

PaLI: A Jointly-Scaled Multilingual Language-Image Model

Abstract

Results

Related Papers