TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/SPHINX: The Joint Mixing of Weights, Tasks, and Visual Emb...

SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models

Ziyi Lin, Chris Liu, Renrui Zhang, Peng Gao, Longtian Qiu, Han Xiao, Han Qiu, Chen Lin, Wenqi Shao, Keqin Chen, Jiaming Han, Siyuan Huang, Yichi Zhang, Xuming He, Hongsheng Li, Yu Qiao

2023-11-13Question AnsweringDescribed Object DetectionPose EstimationLarge Language ModelVisual Question Answering (VQA)Language ModellingVisual Question Answering
PaperPDFCode(official)

Abstract

We present SPHINX, a versatile multi-modal large language model (MLLM) with a joint mixing of model weights, tuning tasks, and visual embeddings. First, for stronger vision-language alignment, we unfreeze the large language model (LLM) during pre-training, and introduce a weight mix strategy between LLMs trained by real-world and synthetic data. By directly integrating the weights from two domains, the mixed LLM can efficiently incorporate diverse semantics with favorable robustness. Then, to enable multi-purpose capabilities, we mix a variety of tasks for joint visual instruction tuning, and design task-specific instructions to avoid inter-task conflict. In addition to the basic visual question answering, we include more challenging tasks such as region-level understanding, caption grounding, document layout detection, and human pose estimation, contributing to mutual enhancement over different scenarios. Additionally, we propose to extract comprehensive visual embeddings from various network architectures, pre-training paradigms, and information granularity, providing language models with more robust image representations. Based on our proposed joint mixing, SPHINX exhibits superior multi-modal understanding capabilities on a wide range of applications. On top of this, we further propose an efficient strategy aiming to better capture fine-grained appearances of high-resolution images. With a mixing of different scales and high-resolution sub-images, SPHINX attains exceptional visual parsing and reasoning performance on existing evaluation benchmarks. We hope our work may cast a light on the exploration of joint mixing in future MLLM research. Code is released at https://github.com/Alpha-VLLM/LLaMA2-Accessory.

Results

TaskDatasetMetricValueModel
Visual Question Answering (VQA)InfiMM-EvalAbductive49.85SPHINX v2
Visual Question Answering (VQA)InfiMM-EvalAnalogical20.69SPHINX v2
Visual Question Answering (VQA)InfiMM-EvalDeductive42.17SPHINX v2
Visual Question Answering (VQA)InfiMM-EvalOverall score39.48SPHINX v2
Visual Question Answering (VQA)MM-VetGPT-4 score40.2SPHINX-2k
Visual Question Answering (VQA)BenchLMMGPT-3.5 score57.43Sphinx-V2-1K
Object DetectionDescription Detection DatasetIntra-scenario ABS mAP7.9SPHINX-7B
Object DetectionDescription Detection DatasetIntra-scenario FULL mAP10.6SPHINX-7B
Object DetectionDescription Detection DatasetIntra-scenario PRES mAP11.4SPHINX-7B
3DDescription Detection DatasetIntra-scenario ABS mAP7.9SPHINX-7B
3DDescription Detection DatasetIntra-scenario FULL mAP10.6SPHINX-7B
3DDescription Detection DatasetIntra-scenario PRES mAP11.4SPHINX-7B
2D ClassificationDescription Detection DatasetIntra-scenario ABS mAP7.9SPHINX-7B
2D ClassificationDescription Detection DatasetIntra-scenario FULL mAP10.6SPHINX-7B
2D ClassificationDescription Detection DatasetIntra-scenario PRES mAP11.4SPHINX-7B
2D Object DetectionDescription Detection DatasetIntra-scenario ABS mAP7.9SPHINX-7B
2D Object DetectionDescription Detection DatasetIntra-scenario FULL mAP10.6SPHINX-7B
2D Object DetectionDescription Detection DatasetIntra-scenario PRES mAP11.4SPHINX-7B
Visual Question AnsweringMM-VetGPT-4 score40.2SPHINX-2k
Visual Question AnsweringBenchLMMGPT-3.5 score57.43Sphinx-V2-1K
16kDescription Detection DatasetIntra-scenario ABS mAP7.9SPHINX-7B
16kDescription Detection DatasetIntra-scenario FULL mAP10.6SPHINX-7B
16kDescription Detection DatasetIntra-scenario PRES mAP11.4SPHINX-7B

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21DENSE: Longitudinal Progress Note Generation with Temporal Modeling of Heterogeneous Clinical Notes Across Hospital Visits2025-07-18From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17$π^3$: Scalable Permutation-Equivariant Visual Geometry Learning2025-07-17Revisiting Reliability in the Reasoning-based Pose Estimation Benchmark2025-07-17