Visual Goal-Step Inference using wikiHow

Yue Yang, Artemis Panagopoulou, Qing Lyu, Li Zhang, Mark Yatskar, Chris Callison-Burch

2021-04-12EMNLP 2021 11VGSI Multimodal Reasoning

Abstract

Understanding what sequence of steps are needed to complete a goal can help artificial intelligence systems reason about human activities. Past work in NLP has examined the task of goal-step inference for text. We introduce the visual analogue. We propose the Visual Goal-Step Inference (VGSI) task, where a model is given a textual goal and must choose which of four images represents a plausible step towards that goal. With a new dataset harvested from wikiHow consisting of 772,277 images representing human actions, we show that our task is challenging for state-of-the-art multimodal models. Moreover, the multimodal representation learned from our data can be effectively transferred to other datasets like HowTo100m, increasing the VGSI accuracy by 15 - 20%. Our task will facilitate multimodal reasoning about procedural events.

Results

Task	Dataset	Metric	Value	Model
Text-To-Image	wikiHow-image	Accuracy	0.7494	Triplet Network

Related Papers

EgoPrune: Efficient Token Pruning for Egomotion Video Reasoning in Embodied Agent2025-07-21 Revisiting Reliability in the Reasoning-based Pose Estimation Benchmark2025-07-17 The Synergy Dilemma of Long-CoT SFT and RL: Investigating Post-Training Techniques for Reasoning VLMs2025-07-10 MagiC: Evaluating Multimodal Cognition Toward Grounded Visual Reasoning2025-07-09 Skywork-R1V3 Technical Report2025-07-08 Enhancing Scientific Visual Question Answering through Multimodal Reasoning and Ensemble Modeling2025-07-08 Perception-Aware Policy Optimization for Multimodal Reasoning2025-07-08 DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge2025-07-06