GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest

Shilong Zhang, Peize Sun, Shoufa Chen, Min Xiao, Wenqi Shao, Wenwei Zhang, Yu Liu, Kai Chen, Ping Luo

2023-07-07Attribute Common Sense Reasoning Large Language Model Visual Question Answering (VQA)Visual Commonsense Reasoning Language Modelling Visual Question Answering

Paper PDF Code Code Code(official)

Abstract

Visual instruction tuning large language model(LLM) on image-text pairs has achieved general-purpose vision-language abilities. However, the lack of region-text pairs limits their advancements to fine-grained multimodal understanding. In this paper, we propose spatial instruction tuning, which introduces the reference to the region-of-interest(RoI) in the instruction. Before sending to LLM, the reference is replaced by RoI features and interleaved with language embeddings as a sequence. Our model GPT4RoI, trained on 7 region-text pair datasets, brings an unprecedented interactive and conversational experience compared to previous image-level models. (1) Interaction beyond language: Users can interact with our model by both language and drawing bounding boxes to flexibly adjust the referring granularity. (2) Versatile multimodal abilities: A variety of attribute information within each RoI can be mined by GPT4RoI, e.g., color, shape, material, action, etc. Furthermore, it can reason about multiple RoIs based on common sense. On the Visual Commonsense Reasoning(VCR) dataset, GPT4RoI achieves a remarkable accuracy of 81.6%, surpassing all existing models by a significant margin (the second place is 75.6%) and almost reaching human-level performance of 85.0%. The code, dataset, and demo can be found at https://github.com/jshilong/GPT4RoI.

Results

Task	Dataset	Metric	Value	Model
Visual Question Answering (VQA)	VCR (Q-AR) test	Accuracy	81.6	GPT4RoI
Visual Question Answering (VQA)	VCR (QA-R) test	Accuracy	91	GPT4RoI
Visual Question Answering (VQA)	VCR (Q-A) test	Accuracy	89.4	GPT4RoI
Visual Question Answering (VQA)	ViP-Bench	GPT-4 score (bbox)	35.1	GPT4ROI 7B (ROI)
Visual Question Answering	ViP-Bench	GPT-4 score (bbox)	35.1	GPT4ROI 7B (ROI)

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21 DENSE: Longitudinal Progress Note Generation with Temporal Modeling of Heterogeneous Clinical Notes Across Hospital Visits2025-07-18 Comparing Apples to Oranges: A Dataset & Analysis of LLM Humour Understanding from Traditional Puns to Topical Jokes2025-07-17 GeoReg: Weight-Constrained Few-Shot Regression for Socio-Economic Estimation using LLM2025-07-17 The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations2025-07-17 Inverse Reinforcement Learning Meets Large Language Model Post-Training: Basics, Advances, and Opportunities2025-07-17 Rethinking the Embodied Gap in Vision-and-Language Navigation: A Holistic Study of Physical and Visual Disparities2025-07-17 VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17