H2OVL-Mississippi Vision Language Models Technical Report

Shaikat Galib, Shanshan Wang, Guanshuo Xu, Pascal Pfeiffer, Ryan Chesler, Mark Landry, Sri Satish Ambati

2024-10-17Document AI Visual Question Answering

Abstract

Smaller vision-language models (VLMs) are becoming increasingly important for privacy-focused, on-device applications due to their ability to run efficiently on consumer hardware for processing enterprise commercial documents and images. These models require strong language understanding and visual capabilities to enhance human-machine interaction. To address this need, we present H2OVL-Mississippi, a pair of small VLMs trained on 37 million image-text pairs using 240 hours of compute on 8 x H100 GPUs. H2OVL-Mississippi-0.8B is a tiny model with 0.8 billion parameters that specializes in text recognition, achieving state of the art performance on the Text Recognition portion of OCRBench and surpassing much larger models in this area. Additionally, we are releasing H2OVL-Mississippi-2B, a 2 billion parameter model for general use cases, exhibiting highly competitive metrics across various academic benchmarks. Both models build upon our prior work with H2O-Danube language models, extending their capabilities into the visual domain. We release them under the Apache 2.0 license, making VLMs accessible to everyone, democratizing document AI and visual LLMs.

Results

Task	Dataset	Metric	Value	Model
Visual Question Answering (VQA)	MM-Vet	GPT-4 score	44.7	H2OVL-Mississippi-2B
Visual Question Answering (VQA)	MM-Vet	GPT-4 score	30	H2OVL-Mississippi-0.8B
Visual Question Answering	MM-Vet	GPT-4 score	44.7	H2OVL-Mississippi-2B
Visual Question Answering	MM-Vet	GPT-4 score	30	H2OVL-Mississippi-0.8B

Related Papers

Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16 Evaluating Attribute Confusion in Fashion Text-to-Image Generation2025-07-09 LinguaMark: Do Multimodal Models Speak Fairly? A Benchmark-Based Evaluation2025-07-09 Barriers in Integrating Medical Visual Question Answering into Radiology Workflows: A Scoping Review and Clinicians' Insights2025-07-09 MagiC: Evaluating Multimodal Cognition Toward Grounded Visual Reasoning2025-07-09 Enhancing Scientific Visual Question Answering through Multimodal Reasoning and Ensemble Modeling2025-07-08 ReLoop: "Seeing Twice and Thinking Backwards" via Closed-loop Training to Mitigate Hallucinations in Multimodal understanding2025-07-07 Revisiting CroPA: A Reproducibility Study and Enhancements for Cross-Prompt Adversarial Transferability in Vision-Language Models2025-06-28