Shikun Liu, Linxi Fan, Edward Johns, Zhiding Yu, Chaowei Xiao, Anima Anandkumar
Recent vision-language models have shown impressive multi-modal generation capabilities. However, typically they require training huge models on massive datasets. As a more scalable alternative, we introduce Prismer, a data- and parameter-efficient vision-language model that leverages an ensemble of task-specific experts. Prismer only requires training of a small number of components, with the majority of network weights inherited from multiple readily-available, pre-trained experts, and kept frozen during training. By leveraging experts from a wide range of domains, we show Prismer can efficiently pool this expert knowledge and adapt it to various vision-language reasoning tasks. In our experiments, we show that Prismer achieves fine-tuned and few-shot learning performance which is competitive with current state-of-the-arts, whilst requiring up to two orders of magnitude less training data. Code is available at https://github.com/NVlabs/prismer.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Visual Question Answering (VQA) | VQA v2 test-dev | Accuracy | 78.43 | Prismer |
| Visual Question Answering (VQA) | VQA v2 test-std | number | 61.39 | Prismer |
| Visual Question Answering (VQA) | VQA v2 test-std | other | 69.7 | Prismer |
| Visual Question Answering (VQA) | VQA v2 test-std | overall | 78.49 | Prismer |
| Visual Question Answering (VQA) | VQA v2 test-std | yes/no | 93.09 | Prismer |
| Image Captioning | nocaps entire | B1 | 84.87 | Prismer |
| Image Captioning | nocaps entire | B2 | 69.99 | Prismer |
| Image Captioning | nocaps entire | B3 | 52.48 | Prismer |
| Image Captioning | nocaps entire | B4 | 33.66 | Prismer |
| Image Captioning | nocaps entire | CIDEr | 110.84 | Prismer |
| Image Captioning | nocaps entire | METEOR | 31.13 | Prismer |
| Image Captioning | nocaps entire | ROUGE-L | 60.55 | Prismer |
| Image Captioning | nocaps entire | SPICE | 14.91 | Prismer |
| Image Captioning | COCO Captions | BLEU-4 | 40.4 | Prismer |
| Image Captioning | COCO Captions | CIDER | 136.5 | Prismer |
| Image Captioning | COCO Captions | METEOR | 31.4 | Prismer |
| Image Captioning | COCO Captions | SPICE | 24.4 | Prismer |
| Image Captioning | nocaps val | CIDEr | 107.9 | Prismer |
| Image Captioning | nocaps val | SPICE | 14.8 | Prismer |