Neighbor-view Enhanced Model for Vision and Language Navigation

Dong An, Yuankai Qi, Yan Huang, Qi Wu, Liang Wang, Tieniu Tan

2021-07-15Navigate Vision and Language Navigation

Abstract

Vision and Language Navigation (VLN) requires an agent to navigate to a target location by following natural language instructions. Most of existing works represent a navigation candidate by the feature of the corresponding single view where the candidate lies in. However, an instruction may mention landmarks out of the single view as references, which might lead to failures of textual-visual matching of existing methods. In this work, we propose a multi-module Neighbor-View Enhanced Model (NvEM) to adaptively incorporate visual contexts from neighbor views for better textual-visual matching. Specifically, our NvEM utilizes a subject module and a reference module to collect contexts from neighbor views. The subject module fuses neighbor views at a global level, and the reference module fuses neighbor objects at a local level. Subjects and references are adaptively determined via attention me'chanisms. Our model also includes an action module to utilize the strong orientation guidance (e.g., "turn left") in instructions. Each module predicts navigation action separately and their weighted sum is used for predicting the final action. Extensive experimental results demonstrate the effectiveness of the proposed method on the R2R and R4R benchmarks against several state-of-the-art navigators, and NvEM even beats some pre-training ones. Our code is available at https://github.com/MarSaKi/NvEM.

Results

Task	Dataset	Metric	Value	Model
Vision and Language Navigation	VLN Challenge	error	4.37	MM2021
Vision and Language Navigation	VLN Challenge	length	12.98	MM2021
Vision and Language Navigation	VLN Challenge	oracle success	0.66	MM2021
Vision and Language Navigation	VLN Challenge	spl	0.54	MM2021
Vision and Language Navigation	VLN Challenge	success	0.58	MM2021

Related Papers

Rethinking the Embodied Gap in Vision-and-Language Navigation: A Holistic Study of Physical and Visual Disparities2025-07-17 Vision-based Perception for Autonomous Vehicles in Obstacle Avoidance Scenarios2025-07-16 CogDDN: A Cognitive Demand-Driven Navigation with Decision Optimization and Dual-Process Thinking2025-07-15 Privacy-Preserving Multi-Stage Fall Detection Framework with Semi-supervised Federated Learning and Robotic Vision Confirmation2025-07-14 Automating MD simulations for Proteins using Large language Models: NAMD-Agent2025-07-10 Graph Learning2025-07-08 Visual Hand Gesture Recognition with Deep Learning: A Comprehensive Review of Methods, Datasets, Challenges and Future Research Directions2025-07-06 STRUCTSENSE: A Task-Agnostic Agentic Framework for Structured Information Extraction with Human-In-The-Loop Evaluation and Benchmarking2025-07-04