Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, Chunyuan Li
We present LLaVA-OneVision, a family of open large multimodal models (LMMs) developed by consolidating our insights into data, models, and visual representations in the LLaVA-NeXT blog series. Our experimental results demonstrate that LLaVA-OneVision is the first single model that can simultaneously push the performance boundaries of open LMMs in three important computer vision scenarios: single-image, multi-image, and video scenarios. Importantly, the design of LLaVA-OneVision allows strong transfer learning across different modalities/scenarios, yielding new emerging capabilities. In particular, strong video understanding and cross-scenario capabilities are demonstrated through task transfer from images to videos.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Relation Extraction | Vinoground | Group Score | 21.8 | LLaVA-OneVision-Qwen2-72B |
| Relation Extraction | Vinoground | Text Score | 48.4 | LLaVA-OneVision-Qwen2-72B |
| Relation Extraction | Vinoground | Video Score | 35.2 | LLaVA-OneVision-Qwen2-72B |
| Relation Extraction | Vinoground | Group Score | 14.6 | LLaVA-OneVision-Qwen2-7B |
| Relation Extraction | Vinoground | Text Score | 41.6 | LLaVA-OneVision-Qwen2-7B |
| Relation Extraction | Vinoground | Video Score | 29.4 | LLaVA-OneVision-Qwen2-7B |
| Question Answering | VNBench | Accuracy | 58.7 | LLaVA-OneVision-72B |
| Question Answering | VNBench | Accuracy | 51.8 | LLaVA-OneVision-7B |
| Visual Question Answering (VQA) | VLM2-Bench | Average Score on VLM2-bench (9 subtasks) | 39.35 | LLaVA-OneVision-7B |
| Visual Question Answering (VQA) | VLM2-Bench | GC-mat | 16.6 | LLaVA-OneVision-7B |
| Visual Question Answering (VQA) | VLM2-Bench | GC-trk | 13.7 | LLaVA-OneVision-7B |
| Visual Question Answering (VQA) | VLM2-Bench | OC-cnt | 56.17 | LLaVA-OneVision-7B |
| Visual Question Answering (VQA) | VLM2-Bench | OC-cpr | 47.22 | LLaVA-OneVision-7B |
| Visual Question Answering (VQA) | VLM2-Bench | OC-grp | 27.5 | LLaVA-OneVision-7B |
| Visual Question Answering (VQA) | VLM2-Bench | PC-VID | 47.25 | LLaVA-OneVision-7B |
| Visual Question Answering (VQA) | VLM2-Bench | PC-cnt | 46.67 | LLaVA-OneVision-7B |
| Visual Question Answering (VQA) | VLM2-Bench | PC-cpr | 62 | LLaVA-OneVision-7B |
| Visual Question Answering (VQA) | VLM2-Bench | PC-grp | 37 | LLaVA-OneVision-7B |
| Visual Question Answering (VQA) | MM-Vet | GPT-4 score | 63.7 | LLaVA-OneVision-72B |
| Visual Question Answering (VQA) | MM-Vet | GPT-4 score | 57.5 | LLaVA-OneVision-7B |
| Visual Question Answering (VQA) | MM-Vet | GPT-4 score | 29.1 | LLaVA-OneVision-0.5B |
| Visual Question Answering (VQA) | V*bench | Accuracy | 74.46 | LLaVA-OneVision7B |
| Visual Question Answering (VQA) | SQA3D | Exact Match | 34.2 | LLaVA-NeXT-Video |
| Visual Question Answering (VQA) | ScanQA Test w/ objects | BLEU-4 | 9.8 | LLaVA-NeXT-Video |
| Visual Question Answering (VQA) | ScanQA Test w/ objects | CIDEr | 46.2 | LLaVA-NeXT-Video |
| Visual Question Answering (VQA) | ScanQA Test w/ objects | Exact Match | 18.7 | LLaVA-NeXT-Video |
| Visual Question Answering (VQA) | ScanQA Test w/ objects | METEOR | 9.1 | LLaVA-NeXT-Video |
| Visual Question Answering (VQA) | ScanQA Test w/ objects | ROUGE | 27.8 | LLaVA-NeXT-Video |
| Video Question Answering | OVBench | AVG | 49.5 | LLaVA-OneVision (7B) |
| Video Question Answering | NExT-QA | Accuracy | 80.2 | LLaVA-OV(72B) |
| Video Question Answering | NExT-QA | Accuracy | 79.4 | LLaVA-OV(7B) |
| Video Question Answering | VNBench | Accuracy | 58.7 | LLaVA-OneVision-72B |
| Video Question Answering | VNBench | Accuracy | 51.8 | LLaVA-OneVision-7B |
| Temporal Relation Extraction | Vinoground | Group Score | 21.8 | LLaVA-OneVision-Qwen2-72B |
| Temporal Relation Extraction | Vinoground | Text Score | 48.4 | LLaVA-OneVision-Qwen2-72B |
| Temporal Relation Extraction | Vinoground | Video Score | 35.2 | LLaVA-OneVision-Qwen2-72B |
| Temporal Relation Extraction | Vinoground | Group Score | 14.6 | LLaVA-OneVision-Qwen2-7B |
| Temporal Relation Extraction | Vinoground | Text Score | 41.6 | LLaVA-OneVision-Qwen2-7B |
| Temporal Relation Extraction | Vinoground | Video Score | 29.4 | LLaVA-OneVision-Qwen2-7B |
| Visual Question Answering | MM-Vet | GPT-4 score | 63.7 | LLaVA-OneVision-72B |
| Visual Question Answering | MM-Vet | GPT-4 score | 57.5 | LLaVA-OneVision-7B |
| Visual Question Answering | MM-Vet | GPT-4 score | 29.1 | LLaVA-OneVision-0.5B |
| Visual Question Answering | V*bench | Accuracy | 74.46 | LLaVA-OneVision7B |