TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Prismer: A Vision-Language Model with Multi-Task Experts

Prismer: A Vision-Language Model with Multi-Task Experts

Shikun Liu, Linxi Fan, Edward Johns, Zhiding Yu, Chaowei Xiao, Anima Anandkumar

2023-03-04Few-Shot LearningImage CaptioningVisual Question Answering (VQA)Language Modelling
PaperPDFCodeCode(official)

Abstract

Recent vision-language models have shown impressive multi-modal generation capabilities. However, typically they require training huge models on massive datasets. As a more scalable alternative, we introduce Prismer, a data- and parameter-efficient vision-language model that leverages an ensemble of task-specific experts. Prismer only requires training of a small number of components, with the majority of network weights inherited from multiple readily-available, pre-trained experts, and kept frozen during training. By leveraging experts from a wide range of domains, we show Prismer can efficiently pool this expert knowledge and adapt it to various vision-language reasoning tasks. In our experiments, we show that Prismer achieves fine-tuned and few-shot learning performance which is competitive with current state-of-the-arts, whilst requiring up to two orders of magnitude less training data. Code is available at https://github.com/NVlabs/prismer.

Results

TaskDatasetMetricValueModel
Visual Question Answering (VQA)VQA v2 test-devAccuracy78.43Prismer
Visual Question Answering (VQA)VQA v2 test-stdnumber61.39Prismer
Visual Question Answering (VQA)VQA v2 test-stdother69.7Prismer
Visual Question Answering (VQA)VQA v2 test-stdoverall78.49Prismer
Visual Question Answering (VQA)VQA v2 test-stdyes/no93.09Prismer
Image Captioningnocaps entireB184.87Prismer
Image Captioningnocaps entireB269.99Prismer
Image Captioningnocaps entireB352.48Prismer
Image Captioningnocaps entireB433.66Prismer
Image Captioningnocaps entireCIDEr110.84Prismer
Image Captioningnocaps entireMETEOR31.13Prismer
Image Captioningnocaps entireROUGE-L60.55Prismer
Image Captioningnocaps entireSPICE14.91Prismer
Image CaptioningCOCO CaptionsBLEU-440.4Prismer
Image CaptioningCOCO CaptionsCIDER136.5Prismer
Image CaptioningCOCO CaptionsMETEOR31.4Prismer
Image CaptioningCOCO CaptionsSPICE24.4Prismer
Image Captioningnocaps valCIDEr107.9Prismer
Image Captioningnocaps valSPICE14.8Prismer

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21GLAD: Generalizable Tuning for Vision-Language Models2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17Making Language Model a Hierarchical Classifier and Generator2025-07-17The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations2025-07-17Inverse Reinforcement Learning Meets Large Language Model Post-Training: Basics, Advances, and Opportunities2025-07-17Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16MGFFD-VLM: Multi-Granularity Prompt Learning for Face Forgery Detection with VLM2025-07-16