Learning Long-term Visual Dynamics with Region Proposal Interaction Networks

Haozhi Qi, Xiaolong Wang, Deepak Pathak, Yi Ma, Jitendra Malik

2020-08-05ICLR 2021 1Region Proposal Common Sense Reasoning Visual Reasoning

Abstract

Learning long-term dynamics models is the key to understanding physical common sense. Most existing approaches on learning dynamics from visual input sidestep long-term predictions by resorting to rapid re-planning with short-term models. This not only requires such models to be super accurate but also limits them only to tasks where an agent can continuously obtain feedback and take action at each step until completion. In this paper, we aim to leverage the ideas from success stories in visual recognition tasks to build object representations that can capture inter-object and object-environment interactions over a long-range. To this end, we propose Region Proposal Interaction Networks (RPIN), which reason about each object's trajectory in a latent region-proposal feature space. Thanks to the simple yet effective object representation, our approach outperforms prior methods by a significant margin both in terms of prediction quality and their ability to plan for downstream tasks, and also generalize well to novel environments. Code, pre-trained models, and more visualization results are available at https://haozhi.io/RPIN.

Results

Task	Dataset	Metric	Value	Model
Visual Reasoning	PHYRE-1B-Within	AUCCESS	85.2	RPIN
Visual Reasoning	PHYRE-1B-Cross	AUCCESS	42.2	RPIN

Related Papers

Comparing Apples to Oranges: A Dataset & Analysis of LLM Humour Understanding from Traditional Puns to Topical Jokes2025-07-17 LaViPlan : Language-Guided Visual Path Planning with RLVR2025-07-17 Beyond Task-Specific Reasoning: A Unified Conditional Generative Framework for Abstract Visual Reasoning2025-07-15 PyVision: Agentic Vision with Dynamic Tooling2025-07-10 Orchestrator-Agent Trust: A Modular Agentic AI Visual Classification System with Trust-Aware Orchestration and RAG-Based Reasoning2025-07-09 MagiC: Evaluating Multimodal Cognition Toward Grounded Visual Reasoning2025-07-09 Skywork-R1V3 Technical Report2025-07-08 High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning2025-07-08