TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/MobileVLM : A Fast, Strong and Open Vision Language Assist...

MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile Devices

Xiangxiang Chu, Limeng Qiao, Xinyang Lin, Shuang Xu, Yang Yang, Yiming Hu, Fei Wei, Xinyu Zhang, Bo Zhang, Xiaolin Wei, Chunhua Shen

2023-12-28Image ClassificationReferring expression generationAutoMLReferring Expression ComprehensionLanguage Modelling
PaperPDFCode(official)

Abstract

We present MobileVLM, a competent multimodal vision language model (MMVLM) targeted to run on mobile devices. It is an amalgamation of a myriad of architectural designs and techniques that are mobile-oriented, which comprises a set of language models at the scale of 1.4B and 2.7B parameters, trained from scratch, a multimodal vision model that is pre-trained in the CLIP fashion, cross-modality interaction via an efficient projector. We evaluate MobileVLM on several typical VLM benchmarks. Our models demonstrate on par performance compared with a few much larger models. More importantly, we measure the inference speed on both a Qualcomm Snapdragon 888 CPU and an NVIDIA Jeston Orin GPU, and we obtain state-of-the-art performance of 21.5 tokens and 65.3 tokens per second, respectively. Our code will be made available at: https://github.com/Meituan-AutoML/MobileVLM.

Results

TaskDatasetMetricValueModel
Image ClassificationColonINST-v1 (Seen)Accuray93.64MobileVLM-1.7B (w/ LoRA, w/ extra data)
Image ClassificationColonINST-v1 (Seen)Accuray93.02MobileVLM-1.7B (w/o LoRA, w/ extra data)
Image ClassificationColonINST-v1 (Unseen)Accuray80.44MobileVLM-1.7B (w/ LoRA, w/ extra data)
Image ClassificationColonINST-v1 (Unseen)Accuray78.75MobileVLM-1.7B (w/o LoRA, w/ extra data)
Referring expression generationColonINST-v1 (Unseen)Accuray78.03MobileVLM-1.7B (w/ LoRA, w/ extra data)
Referring expression generationColonINST-v1 (Unseen)Accuray73.14MobileVLM-1.7B (w/o LoRA, w/ extra data)
Referring expression generationColonINST-v1 (Seen)Accuray97.87MobileVLM-1.7B (w/ LoRA, w/ extra data)
Referring expression generationColonINST-v1 (Seen)Accuray97.78MobileVLM-1.7B (w/o LoRA, w/ extra data)

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21Automatic Classification and Segmentation of Tunnel Cracks Based on Deep Learning and Visual Explanations2025-07-18Adversarial attacks to image classification systems using evolutionary algorithms2025-07-17Efficient Adaptation of Pre-trained Vision Transformer underpinned by Approximately Orthogonal Fine-Tuning Strategy2025-07-17Federated Learning for Commercial Image Sources2025-07-17MUPAX: Multidimensional Problem Agnostic eXplainable AI2025-07-17Making Language Model a Hierarchical Classifier and Generator2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17