TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Learning to Localize Objects Improves Spatial Reasoning in...

Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs

Kanchana Ranasinghe, Satya Narayan Shukla, Omid Poursaeed, Michael S. Ryoo, Tsung-Yu Lin

2024-04-11CVPR 2024 1Spatial ReasoningQuestion AnsweringDescriptiveZero-Shot Region DescriptionHallucinationVideo Question AnsweringVisual Question Answering (VQA)Visual Question Answering
PaperPDFCode

Abstract

Integration of Large Language Models (LLMs) into visual domain tasks, resulting in visual-LLMs (V-LLMs), has enabled exceptional performance in vision-language tasks, particularly for visual question answering (VQA). However, existing V-LLMs (e.g. BLIP-2, LLaVA) demonstrate weak spatial reasoning and localization awareness. Despite generating highly descriptive and elaborate textual answers, these models fail at simple tasks like distinguishing a left vs right location. In this work, we explore how image-space coordinate based instruction fine-tuning objectives could inject spatial awareness into V-LLMs. We discover optimal coordinate representations, data-efficient instruction fine-tuning objectives, and pseudo-data generation strategies that lead to improved spatial awareness in V-LLMs. Additionally, our resulting model improves VQA across image and video domains, reduces undesired hallucination, and generates better contextual object descriptions. Experiments across 5 vision-language tasks involving 14 different datasets establish the clear performance improvements achieved by our proposed framework.

Results

TaskDatasetMetricValueModel
Visual Question Answering (VQA)VQA v2 test-devAccuracy56.2LocVLM-L
Visual Question Answering (VQA)GQAAccuracy50.2LocVLM-L
Visual Question Answering (VQA)VQA v2 valAccuracy55.9LocVLM-L
Video Question AnsweringActivityNet-QAAccuracy38.2LocVLM-Vid-B+
Video Question AnsweringActivityNet-QAAccuracy37.4LocVLM-Vid-B
Video Question AnsweringMSVD-QAAccuracy66.1LocVLM-Vid-B
Video Question AnsweringTGIF-QAAccuracy51.8LocVLM-Vid-B
Video Question AnsweringMSR-VTTAccuracy51.2LocVLM-Vid-B
Visual Question AnsweringVQA v2 test-devAccuracy56.2LocVLM-L
Visual Question AnsweringGQAAccuracy50.2LocVLM-L
Visual Question AnsweringVQA v2 valAccuracy55.9LocVLM-L

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17DiffRhythm+: Controllable and Flexible Full-Length Song Generation with Preference Optimization2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17MindJourney: Test-Time Scaling with World Models for Spatial Reasoning2025-07-16Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16