Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of Low-rank Experts

Jialin Wu, Xia Hu, Yaqing Wang, Bo Pang, Radu Soricut

2023-12-01CVPR 2024 1Chart Question Answering Image Captioning Document AI Object Counting Visual Question Answering (VQA)

Paper PDF

Abstract

Large multi-modal models (LMMs) exhibit remarkable performance across numerous tasks. However, generalist LMMs often suffer from performance degradation when tuned over a large collection of tasks. Recent research suggests that Mixture of Experts (MoE) architectures are useful for instruction tuning, but for LMMs of parameter size around O(50-100B), the prohibitive cost of replicating and storing the expert models severely limits the number of experts we can use. We propose Omni-SMoLA, an architecture that uses the Soft MoE approach to (softly) mix many multimodal low rank experts, and avoids introducing a significant number of new parameters compared to conventional MoE models. The core intuition here is that the large model provides a foundational backbone, while different lightweight experts residually learn specialized knowledge, either per-modality or multimodally. Extensive experiments demonstrate that the SMoLA approach helps improve the generalist performance across a broad range of generative vision-and-language tasks, achieving new SoTA generalist performance that often matches or outperforms single specialized LMM baselines, as well as new SoTA specialist performance.

Results

Task	Dataset	Metric	Value	Model
Visual Question Answering (VQA)	AI2D	EM	82.5	SMoLA-PaLI-X Specialist Model
Visual Question Answering (VQA)	AI2D	EM	81.4	SMoLA-PaLI-X Generalist Model
Visual Question Answering (VQA)	A-OKVQA	DA VQA Score	70.55	SMoLA-PaLI-X Specialist Model
Visual Question Answering (VQA)	A-OKVQA	MC Accuracy	83.75	SMoLA-PaLI-X Specialist Model
Visual Question Answering (VQA)	DocVQA test	ANLS	0.908	SMoLA-PaLI-X Specialist
Visual Question Answering (VQA)	DocVQA test	ANLS	0.906	SMoLA-PaLI-X Generalist
Visual Question Answering (VQA)	InfographicVQA	ANLS	66.2	SMoLA-PaLI-X Specialist
Visual Question Answering (VQA)	InfographicVQA	ANLS	65.6	SMoLA-PaLI-X Generalist
Visual Question Answering (VQA)	ChartQA	1:1 Accuracy	74.6	SMoLA-PaLI-X Specialist Model
Visual Question Answering (VQA)	ChartQA	1:1 Accuracy	73.8	SMoLA-PaLI-X Generalist Model
Object Counting	TallyQA-Complex	Accuracy	77.1	SMoLA-PaLI-X Specialist
Object Counting	TallyQA-Complex	Accuracy	70.7	SMoLA-PaLI-X Generalist (0 shot)
Object Counting	TallyQA-Simple	Accuracy	86.3	SMoLA-PaLI-X Specialist
Object Counting	TallyQA-Simple	Accuracy	83.3	SMoLA-PaLI-X Generalist (0 shot)
Chart Question Answering	ChartQA	1:1 Accuracy	74.6	SMoLA-PaLI-X Specialist Model
Chart Question Answering	ChartQA	1:1 Accuracy	73.8	SMoLA-PaLI-X Generalist Model

Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of Low-rank Experts

Abstract

Results

Related Papers

Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of Low-rank Experts

Abstract

Results

Related Papers