TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Omni-SMoLA: Boosting Generalist Multimodal Models with Sof...

Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of Low-rank Experts

Jialin Wu, Xia Hu, Yaqing Wang, Bo Pang, Radu Soricut

2023-12-01CVPR 2024 1Chart Question AnsweringImage CaptioningDocument AIObject CountingVisual Question Answering (VQA)
PaperPDF

Abstract

Large multi-modal models (LMMs) exhibit remarkable performance across numerous tasks. However, generalist LMMs often suffer from performance degradation when tuned over a large collection of tasks. Recent research suggests that Mixture of Experts (MoE) architectures are useful for instruction tuning, but for LMMs of parameter size around O(50-100B), the prohibitive cost of replicating and storing the expert models severely limits the number of experts we can use. We propose Omni-SMoLA, an architecture that uses the Soft MoE approach to (softly) mix many multimodal low rank experts, and avoids introducing a significant number of new parameters compared to conventional MoE models. The core intuition here is that the large model provides a foundational backbone, while different lightweight experts residually learn specialized knowledge, either per-modality or multimodally. Extensive experiments demonstrate that the SMoLA approach helps improve the generalist performance across a broad range of generative vision-and-language tasks, achieving new SoTA generalist performance that often matches or outperforms single specialized LMM baselines, as well as new SoTA specialist performance.

Results

TaskDatasetMetricValueModel
Visual Question Answering (VQA)AI2DEM82.5SMoLA-PaLI-X Specialist Model
Visual Question Answering (VQA)AI2DEM81.4SMoLA-PaLI-X Generalist Model
Visual Question Answering (VQA)A-OKVQADA VQA Score70.55SMoLA-PaLI-X Specialist Model
Visual Question Answering (VQA)A-OKVQAMC Accuracy83.75SMoLA-PaLI-X Specialist Model
Visual Question Answering (VQA)DocVQA testANLS0.908SMoLA-PaLI-X Specialist
Visual Question Answering (VQA)DocVQA testANLS0.906SMoLA-PaLI-X Generalist
Visual Question Answering (VQA)InfographicVQAANLS66.2SMoLA-PaLI-X Specialist
Visual Question Answering (VQA)InfographicVQAANLS65.6SMoLA-PaLI-X Generalist
Visual Question Answering (VQA)ChartQA1:1 Accuracy74.6SMoLA-PaLI-X Specialist Model
Visual Question Answering (VQA)ChartQA1:1 Accuracy73.8SMoLA-PaLI-X Generalist Model
Object CountingTallyQA-ComplexAccuracy77.1SMoLA-PaLI-X Specialist
Object CountingTallyQA-ComplexAccuracy70.7SMoLA-PaLI-X Generalist (0 shot)
Object CountingTallyQA-SimpleAccuracy86.3SMoLA-PaLI-X Specialist
Object CountingTallyQA-SimpleAccuracy83.3SMoLA-PaLI-X Generalist (0 shot)
Chart Question AnsweringChartQA1:1 Accuracy74.6SMoLA-PaLI-X Specialist Model
Chart Question AnsweringChartQA1:1 Accuracy73.8SMoLA-PaLI-X Generalist Model

Related Papers

VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16MGFFD-VLM: Multi-Granularity Prompt Learning for Face Forgery Detection with VLM2025-07-16Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16Car Object Counting and Position Estimation via Extension of the CLIP-EBC Framework2025-07-11Evaluating Attribute Confusion in Fashion Text-to-Image Generation2025-07-09LinguaMark: Do Multimodal Models Speak Fairly? A Benchmark-Based Evaluation2025-07-09Mask-aware Text-to-Image Retrieval: Referring Expression Segmentation Meets Cross-modal Retrieval2025-06-28