MMFuser: Multimodal Multi-Layer Feature Fuser for Fine-Grained Vision-Language Understanding

Yue Cao, Yangzhou Liu, Zhe Chen, Guangchen Shi, Wenhai Wang, Danhuai Zhao, Tong Lu

2024-10-15Visual Question Answering

Abstract

Despite significant advancements in Multimodal Large Language Models (MLLMs) for understanding complex human intentions through cross-modal interactions, capturing intricate image details remains challenging. Previous methods integrating multiple vision encoders to enhance visual detail introduce redundancy and computational overhead. We observe that most MLLMs utilize only the last-layer feature map of the vision encoder for visual representation, neglecting the rich fine-grained information in shallow feature maps. To address this issue, we propose \modelname, a simple yet effective multi-layer feature fuser that efficiently integrates deep and shallow features from Vision Transformers (ViTs). Specifically, it leverages semantically aligned deep features as queries to dynamically extract missing details from shallow features, thus preserving semantic alignment while enriching the representation with fine-grained information. Applied to the LLaVA-1.5 model, \modelname~achieves significant improvements in visual representation and benchmark performance, providing a more flexible and lightweight solution compared to multi-encoder ensemble methods. The code and model have been released at https://github.com/yuecao0119/MMFuser.

Results

Task	Dataset	Metric	Value	Model
Visual Question Answering (VQA)	MM-Vet	GPT-4 score	36.6	LLaVA-1.5-13B (+ MMFuser)
Visual Question Answering	MM-Vet	GPT-4 score	36.6	LLaVA-1.5-13B (+ MMFuser)

Related Papers

Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16 Evaluating Attribute Confusion in Fashion Text-to-Image Generation2025-07-09 LinguaMark: Do Multimodal Models Speak Fairly? A Benchmark-Based Evaluation2025-07-09 Barriers in Integrating Medical Visual Question Answering into Radiology Workflows: A Scoping Review and Clinicians' Insights2025-07-09 MagiC: Evaluating Multimodal Cognition Toward Grounded Visual Reasoning2025-07-09 Enhancing Scientific Visual Question Answering through Multimodal Reasoning and Ensemble Modeling2025-07-08 ReLoop: "Seeing Twice and Thinking Backwards" via Closed-loop Training to Mitigate Hallucinations in Multimodal understanding2025-07-07 Revisiting CroPA: A Reproducibility Study and Enhancements for Cross-Prompt Adversarial Transferability in Vision-Language Models2025-06-28