Prismer: A Vision-Language Model with Multi-Task Experts

Shikun Liu, Linxi Fan, Edward Johns, Zhiding Yu, Chaowei Xiao, Anima Anandkumar

2023-03-04Few-Shot Learning Image Captioning Visual Question Answering (VQA)Language Modelling

Abstract

Recent vision-language models have shown impressive multi-modal generation capabilities. However, typically they require training huge models on massive datasets. As a more scalable alternative, we introduce Prismer, a data- and parameter-efficient vision-language model that leverages an ensemble of task-specific experts. Prismer only requires training of a small number of components, with the majority of network weights inherited from multiple readily-available, pre-trained experts, and kept frozen during training. By leveraging experts from a wide range of domains, we show Prismer can efficiently pool this expert knowledge and adapt it to various vision-language reasoning tasks. In our experiments, we show that Prismer achieves fine-tuned and few-shot learning performance which is competitive with current state-of-the-arts, whilst requiring up to two orders of magnitude less training data. Code is available at https://github.com/NVlabs/prismer.

Results

Task	Dataset	Metric	Value	Model
Visual Question Answering (VQA)	VQA v2 test-dev	Accuracy	78.43	Prismer
Visual Question Answering (VQA)	VQA v2 test-std	number	61.39	Prismer
Visual Question Answering (VQA)	VQA v2 test-std	other	69.7	Prismer
Visual Question Answering (VQA)	VQA v2 test-std	overall	78.49	Prismer
Visual Question Answering (VQA)	VQA v2 test-std	yes/no	93.09	Prismer
Image Captioning	nocaps entire	B1	84.87	Prismer
Image Captioning	nocaps entire	B2	69.99	Prismer
Image Captioning	nocaps entire	B3	52.48	Prismer
Image Captioning	nocaps entire	B4	33.66	Prismer
Image Captioning	nocaps entire	CIDEr	110.84	Prismer
Image Captioning	nocaps entire	METEOR	31.13	Prismer
Image Captioning	nocaps entire	ROUGE-L	60.55	Prismer
Image Captioning	nocaps entire	SPICE	14.91	Prismer
Image Captioning	COCO Captions	BLEU-4	40.4	Prismer
Image Captioning	COCO Captions	CIDER	136.5	Prismer
Image Captioning	COCO Captions	METEOR	31.4	Prismer
Image Captioning	COCO Captions	SPICE	24.4	Prismer
Image Captioning	nocaps val	CIDEr	107.9	Prismer
Image Captioning	nocaps val	SPICE	14.8	Prismer

Prismer: A Vision-Language Model with Multi-Task Experts

Abstract

Results

Related Papers

Prismer: A Vision-Language Model with Multi-Task Experts

Abstract

Results

Related Papers