TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Inst-IT: Boosting Multimodal Instance Understanding via Ex...

Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning

Wujian Peng, Lingchen Meng, Yitong Chen, Yiweng Xie, Yang Liu, Tao Gui, Hang Xu, Xipeng Qiu, Zuxuan Wu, Yu-Gang Jiang

2024-12-04visual instruction followingMultimodal Large Language ModelVideo UnderstandingVisual Question Answering
PaperPDFCode(official)

Abstract

Large Multimodal Models (LMMs) have made significant breakthroughs with the advancement of instruction tuning. However, while existing models can understand images and videos at a holistic level, they still struggle with instance-level understanding that requires a more nuanced comprehension and alignment. Instance-level understanding is crucial, as it focuses on the specific elements that we are most interested in. Excitingly, existing works find that the state-of-the-art LMMs exhibit strong instance understanding capabilities when provided with explicit visual cues. Motivated by this, we introduce an automated annotation pipeline assisted by GPT-4o to extract instance-level information from images and videos through explicit visual prompting for instance guidance. Building upon this pipeline, we proposed Inst-IT, a solution to enhance LMMs in Instance understanding via explicit visual prompt Instruction Tuning. Inst-IT consists of a benchmark to diagnose multimodal instance-level understanding, a large-scale instruction-tuning dataset, and a continuous instruction-tuning training paradigm to effectively enhance spatial-temporal instance understanding capabilities of existing LMMs. Experimental results show that, with the boost of Inst-IT, our models not only achieve outstanding performance on Inst-IT Bench but also demonstrate significant improvements across various generic image and video understanding benchmarks. This highlights that our dataset not only boosts instance-level understanding but also strengthens the overall capabilities of generic image and video comprehension.

Results

TaskDatasetMetricValueModel
Visual Question Answering (VQA)ViP-BenchGPT-4 score (bbox)50.5LLaVA-NeXT-Inst-IT-Qwen2-7B (Visual Prompt
Visual Question Answering (VQA)ViP-BenchGPT-4 score (human)49LLaVA-NeXT-Inst-IT-Qwen2-7B (Visual Prompt
Visual Question Answering (VQA)ViP-BenchGPT-4 score (bbox)45.1LLaVA-NeXT-Inst-IT-Vicuna-7B (Visual Prompt
Visual Question Answering (VQA)ViP-BenchGPT-4 score (human)48.2LLaVA-NeXT-Inst-IT-Vicuna-7B (Visual Prompt
Visual Question AnsweringViP-BenchGPT-4 score (bbox)50.5LLaVA-NeXT-Inst-IT-Qwen2-7B (Visual Prompt
Visual Question AnsweringViP-BenchGPT-4 score (human)49LLaVA-NeXT-Inst-IT-Qwen2-7B (Visual Prompt
Visual Question AnsweringViP-BenchGPT-4 score (bbox)45.1LLaVA-NeXT-Inst-IT-Vicuna-7B (Visual Prompt
Visual Question AnsweringViP-BenchGPT-4 score (human)48.2LLaVA-NeXT-Inst-IT-Vicuna-7B (Visual Prompt

Related Papers

VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding2025-07-17Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16LRMR: LLM-Driven Relational Multi-node Ranking for Lymph Node Metastasis Assessment in Rectal Cancer2025-07-15MFGDiffusion: Mask-Guided Smoke Synthesis for Enhanced Forest Fire Detection2025-07-15KptLLM++: Towards Generic Keypoint Comprehension with Large Language Model2025-07-15UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks2025-07-15Chat with AI: The Surprising Turn of Real-time Video Communication from Human to AI2025-07-14EmbRACE-3K: Embodied Reasoning and Action in Complex Environments2025-07-14