TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Honeybee: Locality-enhanced Projector for Multimodal LLM

Honeybee: Locality-enhanced Projector for Multimodal LLM

Junbum Cha, Wooyoung Kang, Jonghwan Mun, Byungseok Roh

2023-12-11CVPR 2024 1Science Question Answering
PaperPDFCode(official)

Abstract

In Multimodal Large Language Models (MLLMs), a visual projector plays a crucial role in bridging pre-trained vision encoders with LLMs, enabling profound visual understanding while harnessing the LLMs' robust capabilities. Despite the importance of the visual projector, it has been relatively less explored. In this study, we first identify two essential projector properties: (i) flexibility in managing the number of visual tokens, crucial for MLLMs' overall efficiency, and (ii) preservation of local context from visual features, vital for spatial understanding. Based on these findings, we propose a novel projector design that is both flexible and locality-enhanced, effectively satisfying the two desirable properties. Additionally, we present comprehensive strategies to effectively utilize multiple and multifaceted instruction datasets. Through extensive experiments, we examine the impact of individual design choices. Finally, our proposed MLLM, Honeybee, remarkably outperforms previous state-of-the-art methods across various benchmarks, including MME, MMBench, SEED-Bench, and LLaVA-Bench, achieving significantly higher efficiency. Code and models are available at https://github.com/kakaobrain/honeybee.

Results

TaskDatasetMetricValueModel
Question AnsweringScienceQAAvg. Accuracy94.39Honeybee
Question AnsweringScienceQAGrades 1-695.04Honeybee
Question AnsweringScienceQAGrades 7-1293.21Honeybee
Question AnsweringScienceQAImage Context93.75Honeybee
Question AnsweringScienceQALanguage Science91.18Honeybee
Question AnsweringScienceQANatural Science95.2Honeybee
Question AnsweringScienceQANo Context93.17Honeybee
Question AnsweringScienceQASocial Science96.29Honeybee
Question AnsweringScienceQAText Context94.48Honeybee

Related Papers

BioRAG: A RAG-LLM Framework for Biological Question Reasoning2024-08-02SciQAG: A Framework for Auto-Generated Science Question Answering Dataset with Fine-grained Evaluation2024-05-16Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization2024-02-05Automated Answer Validation using Text Similarity2024-01-13Boosting the Power of Small Multimodal Reasoning Models to Match Larger Models with Self-Consistency Training2023-11-23ATLANTIC: Structure-Aware Retrieval-Augmented Language Model for Interdisciplinary Science2023-11-21Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding2023-11-14A Survey on Interpretable Cross-modal Reasoning2023-09-05