TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/MM1.5: Methods, Analysis & Insights from Multimodal LLM Fi...

MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning

Haotian Zhang, Mingfei Gao, Zhe Gan, Philipp Dufter, Nina Wenzel, Forrest Huang, Dhruti Shah, Xianzhi Du, BoWen Zhang, Yanghao Li, Sam Dodge, Keen You, Zhen Yang, Aleksei Timofeev, Mingze Xu, Hong-You Chen, Jean-Philippe Fauconnier, Zhengfeng Lai, Haoxuan You, ZiRui Wang, Afshin Dehghan, Peter Grasch, Yinfei Yang

2024-09-30Video UnderstandingVisual Question AnsweringOptical Character Recognition (OCR)
PaperPDF

Abstract

We present MM1.5, a new family of multimodal large language models (MLLMs) designed to enhance capabilities in text-rich image understanding, visual referring and grounding, and multi-image reasoning. Building upon the MM1 architecture, MM1.5 adopts a data-centric approach to model training, systematically exploring the impact of diverse data mixtures across the entire model training lifecycle. This includes high-quality OCR data and synthetic captions for continual pre-training, as well as an optimized visual instruction-tuning data mixture for supervised fine-tuning. Our models range from 1B to 30B parameters, encompassing both dense and mixture-of-experts (MoE) variants, and demonstrate that careful data curation and training strategies can yield strong performance even at small scales (1B and 3B). Additionally, we introduce two specialized variants: MM1.5-Video, designed for video understanding, and MM1.5-UI, tailored for mobile UI understanding. Through extensive empirical studies and ablations, we provide detailed insights into the training processes and decisions that inform our final designs, offering valuable guidance for future research in MLLM development.

Results

TaskDatasetMetricValueModel
Visual Question Answering (VQA)MM-VetGPT-4 score52MM1.5-30B
Visual Question Answering (VQA)MM-VetGPT-4 score43.7MM1.5-3B-MoE
Visual Question Answering (VQA)MM-VetGPT-4 score42.2MM1.5-7B
Visual Question Answering (VQA)MM-VetGPT-4 score41MM1.5-3B
Visual Question Answering (VQA)MM-VetGPT-4 score39.8MM1.5-1B-MoE
Visual Question Answering (VQA)MM-VetGPT-4 score37.4MM1.5-1B
Visual Question AnsweringMM-VetGPT-4 score52MM1.5-30B
Visual Question AnsweringMM-VetGPT-4 score43.7MM1.5-3B-MoE
Visual Question AnsweringMM-VetGPT-4 score42.2MM1.5-7B
Visual Question AnsweringMM-VetGPT-4 score41MM1.5-3B
Visual Question AnsweringMM-VetGPT-4 score39.8MM1.5-1B-MoE
Visual Question AnsweringMM-VetGPT-4 score37.4MM1.5-1B

Related Papers

VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17DeQA-Doc: Adapting DeQA-Score to Document Image Quality Assessment2025-07-17Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks2025-07-15Seeing the Signs: A Survey of Edge-Deployable OCR Models for Billboard Visibility Analysis2025-07-15EmbRACE-3K: Embodied Reasoning and Action in Complex Environments2025-07-14Chat with AI: The Surprising Turn of Real-time Video Communication from Human to AI2025-07-14