MMInstruct-GPT4V
MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity
ImagesTextsApache License, Version 2.0Introduced 2024-07-22
Vision-language supervised fine-tuning effectively enhances VLLM performance, but existing visual instruction tuning datasets have limitations:
- Instruction Annotation Quality: Despite strong performance, advanced VLLMs may generate instructions with inaccuracies, such as hallucinations.
- Instruction and Image Diversity: Limited instruction types and lack of diverse image data impact the model's ability to generate varied and realistic outputs.
MMInstruct Dataset
To address these challenges, we created the MMInstruct dataset, featuring:
- 973K instructions from 24 domains
- Four instruction types: Judgement, Multiple-Choice, Long Visual Question Answering, and Short Visual Question Answering.