LaKo: Knowledge-driven Visual Question Answering via Late Knowledge-to-Text Injection

Zhuo Chen, Yufeng Huang, Jiaoyan Chen, Yuxia Geng, Yin Fang, Jeff Pan, Ningyu Zhang, Wen Zhang

2022-07-26Question Answering Knowledge Graphs Text Generation Visual Question Answering (VQA)Visual Question Answering

Abstract

Visual question answering (VQA) often requires an understanding of visual concepts and language semantics, which relies on external knowledge. Most existing methods exploit pre-trained language models or/and unstructured text, but the knowledge in these resources are often incomplete and noisy. Some other methods prefer to use knowledge graphs (KGs) which often have intensive structured knowledge, but the research is still quite preliminary. In this paper, we propose LaKo, a knowledge-driven VQA method via Late Knowledge-to-text Injection. To effectively incorporate an external KG, we transfer triples into textual format and propose a late injection mechanism for knowledge fusion. Finally we address VQA as a text generation task with an effective encoder-decoder paradigm, which achieves state-of-the-art results on OKVQA dataset.

Results

Task	Dataset	Metric	Value	Model
Visual Question Answering (VQA)	OK-VQA	Accuracy	47.01	LaKo
Visual Question Answering (VQA)	OK-VQA	Accuracy	42.03	T5(Tan and Bansal, 2019) + Prefixes
Visual Question Answering (VQA)	VQA v2 test-dev	Accuracy	68.07	LaKo

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17 Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17 Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17 City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17 SMART: Relation-Aware Learning of Geometric Representations for Knowledge Graphs2025-07-17 Making Language Model a Hierarchical Classifier and Generator2025-07-17 VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17 Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16