A simple neural network module for relational reasoning

Adam Santoro, David Raposo, David G. T. Barrett, Mateusz Malinowski, Razvan Pascanu, Peter Battaglia, Timothy Lillicrap

2017-06-05NeurIPS 2017 12Question Answering Image Retrieval with Multi-Modal Query Relational Reasoning Visual Question Answering (VQA)Visual Question Answering

Paper PDF Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code

Abstract

Relational reasoning is a central component of generally intelligent behavior, but has proven difficult for neural networks to learn. In this paper we describe how to use Relation Networks (RNs) as a simple plug-and-play module to solve problems that fundamentally hinge on relational reasoning. We tested RN-augmented networks on three tasks: visual question answering using a challenging dataset called CLEVR, on which we achieve state-of-the-art, super-human performance; text-based question answering using the bAbI suite of tasks; and complex reasoning about dynamic physical systems. Then, using a curated dataset called Sort-of-CLEVR we show that powerful convolutional networks do not have a general capacity to solve relational questions, but can gain this capacity when augmented with RNs. Our work shows how a deep learning architecture equipped with an RN module can implicitly discover and learn to reason about entities and their relations.

Results

Task	Dataset	Metric	Value	Model
Visual Question Answering (VQA)	CLEVR	Accuracy	95.5	CNN + LSTM + RN
Image Retrieval with Multi-Modal Query	Fashion200k	Recall@1	13	Relationship
Image Retrieval with Multi-Modal Query	Fashion200k	Recall@10	40.5	Relationship
Image Retrieval with Multi-Modal Query	Fashion200k	Recall@50	62.4	Relationship

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17 Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17 Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17 City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17 VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17 Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16 Is This Just Fantasy? Language Model Representations Reflect Human Judgments of Event Plausibility2025-07-16 MGFFD-VLM: Multi-Granularity Prompt Learning for Face Forgery Detection with VLM2025-07-16