High-Order Attention Models for Visual Question Answering

Idan Schwartz, Alexander G. Schwing, Tamir Hazan

2017-11-12NeurIPS 2017 12Question Answering Vocal Bursts Intensity Prediction Visual Question Answering (VQA)Visual Question Answering

Paper PDF Code(official)

Abstract

The quest for algorithms that enable cognitive abilities is an important part of machine learning. A common trait in many recently investigated cognitive-like tasks is that they take into account different data modalities, such as visual and textual input. In this paper we propose a novel and generally applicable form of attention mechanism that learns high-order correlations between various data modalities. We show that high-order correlations effectively direct the appropriate attention to the relevant elements in the different data modalities that are required to solve the joint task. We demonstrate the effectiveness of our high-order attention mechanism on the task of visual question answering (VQA), where we achieve state-of-the-art performance on the standard VQA dataset.

Results

Task	Dataset	Metric	Value	Model
Visual Question Answering (VQA)	COCO Visual Question Answering (VQA) real images 1.0 multiple choice	Percentage correct	69.3	3-Modalities: Unary + Pairwise + Ternary (ResNet)

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17 Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17 Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17 City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17 VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17 Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16 Is This Just Fantasy? Language Model Representations Reflect Human Judgments of Event Plausibility2025-07-16 MGFFD-VLM: Multi-Granularity Prompt Learning for Face Forgery Detection with VLM2025-07-16