Haotian Liu, Chunyuan Li, Yuheng Li, Yong Jae Lee
Large multimodal models (LMM) have recently shown encouraging progress with visual instruction tuning. In this note, we show that the fully-connected vision-language cross-modal connector in LLaVA is surprisingly powerful and data-efficient. With simple modifications to LLaVA, namely, using CLIP-ViT-L-336px with an MLP projection and adding academic-task-oriented VQA data with simple response formatting prompts, we establish stronger baselines that achieve state-of-the-art across 11 benchmarks. Our final 13B checkpoint uses merely 1.2M publicly available data, and finishes full training in ~1 day on a single 8-A100 node. We hope this can make state-of-the-art LMM research more accessible. Code and model will be publicly available.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Visual Question Answering (VQA) | AutoHallusion | Overall Accuracy | 44.5 | LLaVA-1.5 |
| Visual Question Answering (VQA) | InfiMM-Eval | Abductive | 47.91 | LLaVA-1.5 |
| Visual Question Answering (VQA) | InfiMM-Eval | Analogical | 24.31 | LLaVA-1.5 |
| Visual Question Answering (VQA) | InfiMM-Eval | Deductive | 30.94 | LLaVA-1.5 |
| Visual Question Answering (VQA) | InfiMM-Eval | Overall score | 32.62 | LLaVA-1.5 |
| Visual Question Answering (VQA) | ViP-Bench | GPT-4 score (bbox) | 47.1 | LLaVA-1.5-13B (Coordinates) |
| Visual Question Answering (VQA) | ViP-Bench | GPT-4 score (bbox) | 41.8 | LLaVA-1.5-13B (Visual Prompt) |
| Visual Question Answering (VQA) | ViP-Bench | GPT-4 score (human) | 42.9 | LLaVA-1.5-13B (Visual Prompt) |
| Visual Question Answering (VQA) | BenchLMM | GPT-3.5 score | 55.53 | LLaVA-1.5-13B |
| Visual Question Answering (VQA) | 6-DoF SpatialBench | Orientation-abs | 25.8 | LLaVA-1.5 |
| Visual Question Answering (VQA) | 6-DoF SpatialBench | Orientation-rel | 28.3 | LLaVA-1.5 |
| Visual Question Answering (VQA) | 6-DoF SpatialBench | Position-abs | 24.5 | LLaVA-1.5 |
| Visual Question Answering (VQA) | 6-DoF SpatialBench | Position-rel | 30.9 | LLaVA-1.5 |
| Visual Question Answering (VQA) | 6-DoF SpatialBench | Total | 27.2 | LLaVA-1.5 |
| Image Classification | ColonINST-v1 (Seen) | Accuray | 93.33 | LLaVA-v1.5 (w/ LoRA, w/ extra data) |
| Image Classification | ColonINST-v1 (Seen) | Accuray | 92.97 | LLaVA-v1.5 (w/ LoRA, w/o extra data) |
| Image Classification | ColonINST-v1 (Unseen) | Accuray | 80.89 | LLaVA-v1.5 (w/ LoRA, w/ extra data) |
| Image Classification | ColonINST-v1 (Unseen) | Accuray | 79.1 | LLaVA-v1.5 (w/ LoRA, w/o extra data) |
| Referring expression generation | ColonINST-v1 (Unseen) | Accuray | 72.88 | LLaVA-v1.5 (w/ LoRA, w/ extra data) |
| Referring expression generation | ColonINST-v1 (Unseen) | Accuray | 70.38 | LLaVA-v1.5 (w/ LoRA, w/o extra data) |
| Referring expression generation | ColonINST-v1 (Seen) | Accuray | 99.32 | LLaVA-v1.5 (w/ LoRA, w/ extra data) |
| Referring expression generation | ColonINST-v1 (Seen) | Accuray | 98.58 | LLaVA-v1.5 (w/ LoRA, w/o extra data) |
| Instruction Following | LLaVA-Bench | avg score | 70.7 | LLaVA-v1.5-13B |
| Instruction Following | LLaVA-Bench | avg score | 63.4 | LLaVA-v1.5-7B |
| Factual Inconsistency Detection in Chart Captioning | CHOCOLATE-LVLM | Kendall's Tau-c | 0.002 | LLaVA-1.5-13B |
| Factual Inconsistency Detection in Chart Captioning | CHOCOLATE-FT | Kendall's Tau-c | 0.214 | LLaVA-1.5-13B |
| Factual Inconsistency Detection in Chart Captioning | CHOCOLATE-LLM | Kendall's Tau-c | 0.057 | LLaVA-1.5-13B |
| Visual Question Answering | ViP-Bench | GPT-4 score (bbox) | 47.1 | LLaVA-1.5-13B (Coordinates) |
| Visual Question Answering | ViP-Bench | GPT-4 score (bbox) | 41.8 | LLaVA-1.5-13B (Visual Prompt) |
| Visual Question Answering | ViP-Bench | GPT-4 score (human) | 42.9 | LLaVA-1.5-13B (Visual Prompt) |
| Visual Question Answering | BenchLMM | GPT-3.5 score | 55.53 | LLaVA-1.5-13B |
| Visual Question Answering | 6-DoF SpatialBench | Orientation-abs | 25.8 | LLaVA-1.5 |
| Visual Question Answering | 6-DoF SpatialBench | Orientation-rel | 28.3 | LLaVA-1.5 |
| Visual Question Answering | 6-DoF SpatialBench | Position-abs | 24.5 | LLaVA-1.5 |
| Visual Question Answering | 6-DoF SpatialBench | Position-rel | 30.9 | LLaVA-1.5 |
| Visual Question Answering | 6-DoF SpatialBench | Total | 27.2 | LLaVA-1.5 |