TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Instruction-Guided Visual Masking

Instruction-Guided Visual Masking

Jinliang Zheng, Jianxiong Li, Sijie Cheng, Yinan Zheng, Jiaming Li, Jihao Liu, Yu Liu, Jingjing Liu, Xianyuan Zhan

2024-05-30Instruction FollowingVisual GroundingVisual Question Answering (VQA)Visual Question Answering
PaperPDFCode(official)

Abstract

Instruction following is crucial in contemporary LLM. However, when extended to multimodal setting, it often suffers from misalignment between specific textual instruction and targeted local region of an image. To achieve more accurate and nuanced multimodal instruction following, we introduce Instruction-guided Visual Masking (IVM), a new versatile visual grounding model that is compatible with diverse multimodal models, such as LMM and robot model. By constructing visual masks for instruction-irrelevant regions, IVM-enhanced multimodal models can effectively focus on task-relevant image regions to better align with complex instructions. Specifically, we design a visual masking data generation pipeline and create an IVM-Mix-1M dataset with 1 million image-instruction pairs. We further introduce a new learning technique, Discriminator Weighted Supervised Learning (DWSL) for preferential IVM training that prioritizes high-quality data samples. Experimental results on generic multimodal tasks such as VQA and embodied robotic control demonstrate the versatility of IVM, which as a plug-and-play tool, significantly boosts the performance of diverse multimodal models, yielding new state-of-the-art results across challenging multimodal benchmarks. Code, model and data are available at https://github.com/2toinf/IVM.

Results

TaskDatasetMetricValueModel
Visual Question Answering (VQA)V*benchAccuracy81.2IVM-Enhanced GPT4-V
Visual Question AnsweringV*benchAccuracy81.2IVM-Enhanced GPT4-V

Related Papers

AnyCap Project: A Unified Framework, Dataset, and Benchmark for Controllable Omni-modal Captioning2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17MGFFD-VLM: Multi-Granularity Prompt Learning for Face Forgery Detection with VLM2025-07-16Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16How Many Instructions Can LLMs Follow at Once?2025-07-15DrafterBench: Benchmarking Large Language Models for Tasks Automation in Civil Engineering2025-07-15ViewSRD: 3D Visual Grounding via Structured Multi-View Decomposition2025-07-15Multilingual Multimodal Software Developer for Code Generation2025-07-11