Xiangxiang Chu, Limeng Qiao, Xinyang Lin, Shuang Xu, Yang Yang, Yiming Hu, Fei Wei, Xinyu Zhang, Bo Zhang, Xiaolin Wei, Chunhua Shen
We present MobileVLM, a competent multimodal vision language model (MMVLM) targeted to run on mobile devices. It is an amalgamation of a myriad of architectural designs and techniques that are mobile-oriented, which comprises a set of language models at the scale of 1.4B and 2.7B parameters, trained from scratch, a multimodal vision model that is pre-trained in the CLIP fashion, cross-modality interaction via an efficient projector. We evaluate MobileVLM on several typical VLM benchmarks. Our models demonstrate on par performance compared with a few much larger models. More importantly, we measure the inference speed on both a Qualcomm Snapdragon 888 CPU and an NVIDIA Jeston Orin GPU, and we obtain state-of-the-art performance of 21.5 tokens and 65.3 tokens per second, respectively. Our code will be made available at: https://github.com/Meituan-AutoML/MobileVLM.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Image Classification | ColonINST-v1 (Seen) | Accuray | 93.64 | MobileVLM-1.7B (w/ LoRA, w/ extra data) |
| Image Classification | ColonINST-v1 (Seen) | Accuray | 93.02 | MobileVLM-1.7B (w/o LoRA, w/ extra data) |
| Image Classification | ColonINST-v1 (Unseen) | Accuray | 80.44 | MobileVLM-1.7B (w/ LoRA, w/ extra data) |
| Image Classification | ColonINST-v1 (Unseen) | Accuray | 78.75 | MobileVLM-1.7B (w/o LoRA, w/ extra data) |
| Referring expression generation | ColonINST-v1 (Unseen) | Accuray | 78.03 | MobileVLM-1.7B (w/ LoRA, w/ extra data) |
| Referring expression generation | ColonINST-v1 (Unseen) | Accuray | 73.14 | MobileVLM-1.7B (w/o LoRA, w/ extra data) |
| Referring expression generation | ColonINST-v1 (Seen) | Accuray | 97.87 | MobileVLM-1.7B (w/ LoRA, w/ extra data) |
| Referring expression generation | ColonINST-v1 (Seen) | Accuray | 97.78 | MobileVLM-1.7B (w/o LoRA, w/ extra data) |