Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, Yong Jae Lee

2023-04-17NeurIPS 2023 11Spatial Reasoning visual instruction following Instruction Following MMR total Image Classification Referring expression generation Referring Expression Comprehension Video Question Answering 3D Question Answering (3D-QA)Visual Reasoning 1 Image, 2*2 Stitching Visual Question Answering Image Retrieval

Paper PDF Code Code Code Code(official)Code Code Code Code Code Code Code Code Code

Abstract

Instruction tuning large language models (LLMs) using machine-generated instruction-following data has improved zero-shot capabilities on new tasks, but the idea is less explored in the multimodal field. In this paper, we present the first attempt to use language-only GPT-4 to generate multimodal language-image instruction-following data. By instruction tuning on such generated data, we introduce LLaVA: Large Language and Vision Assistant, an end-to-end trained large multimodal model that connects a vision encoder and LLM for general-purpose visual and language understanding.Our early experiments show that LLaVA demonstrates impressive multimodel chat abilities, sometimes exhibiting the behaviors of multimodal GPT-4 on unseen images/instructions, and yields a 85.1% relative score compared with GPT-4 on a synthetic multimodal instruction-following dataset. When fine-tuned on Science QA, the synergy of LLaVA and GPT-4 achieves a new state-of-the-art accuracy of 92.53%. We make GPT-4 generated visual instruction tuning data, our model and code base publicly available.

Results

Task	Dataset	Metric	Value	Model
Visual Question Answering (VQA)	BenchLMM	GPT-3.5 score	46.83	LLaVA-1.5-7B
Visual Question Answering (VQA)	BenchLMM	GPT-3.5 score	43.5	LLaVA-1-13B
Visual Question Answering (VQA)	EmbSpatial-Bench	Generation	35.19	LLaVA-1.6
Visual Question Answering (VQA)	ScanQA Test w/ objects	BLEU-4	13.5	LL3DA
Visual Question Answering (VQA)	ScanQA Test w/ objects	CIDEr	76.8	LL3DA
Visual Question Answering (VQA)	ScanQA Test w/ objects	METEOR	15.9	LL3DA
Visual Question Answering (VQA)	ScanQA Test w/ objects	ROUGE	37.3	LL3DA
Video Question Answering	MVBench	Avg.	36	LLaVa
Image Classification	ColonINST-v1 (Seen)	Accuray	89.61	LLaVA-v1 (w/ LoRA, w/ extra data)
Image Classification	ColonINST-v1 (Seen)	Accuray	87.86	LLaVA-v1 (w/ LoRA, w/o extra data)
Image Classification	ColonINST-v1 (Unseen)	Accuray	72.08	LLaVA-v1 (w/ LoRA, w/o extra data)
Image Classification	ColonINST-v1 (Unseen)	Accuray	42.17	LLaVA-v1 (w/ LoRA, w/ extra data)
Referring expression generation	ColonINST-v1 (Unseen)	Accuray	68.11	LLaVA-v1 (w/ LoRA, w/o extra data)
Referring expression generation	ColonINST-v1 (Unseen)	Accuray	46.85	LLaVA-v1 (w/ LoRA, w/ extra data)
Referring expression generation	ColonINST-v1 (Seen)	Accuray	86.87	LLaVA-v1 (w/ LoRA, w/ extra data)
Referring expression generation	ColonINST-v1 (Seen)	Accuray	84.55	LLaVA-v1 (w/ LoRA, w/o extra data)
Visual Question Answering	BenchLMM	GPT-3.5 score	46.83	LLaVA-1.5-7B
Visual Question Answering	BenchLMM	GPT-3.5 score	43.5	LLaVA-1-13B
Visual Question Answering	EmbSpatial-Bench	Generation	35.19	LLaVA-1.6
MMR total	MRR-Benchmark	Total Column Score	412	LLaVA-NEXT-34B
MMR total	MRR-Benchmark	Total Column Score	335	LLaVA-NEXT-13B
MMR total	MRR-Benchmark	Total Column Score	243	LLaVA-1.5-13B

Visual Instruction Tuning

Abstract

Results

Related Papers

Visual Instruction Tuning

Abstract

Results

Related Papers