TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/BEVBert: Multimodal Map Pre-training for Language-guided N...

BEVBert: Multimodal Map Pre-training for Language-guided Navigation

Dong An, Yuankai Qi, Yangguang Li, Yan Huang, Liang Wang, Tieniu Tan, Jing Shao

2022-12-08ICCV 2023 1Visual NavigationVision and Language Navigation
PaperPDFCode(official)

Abstract

Large-scale pre-training has shown promising results on the vision-and-language navigation (VLN) task. However, most existing pre-training methods employ discrete panoramas to learn visual-textual associations. This requires the model to implicitly correlate incomplete, duplicate observations within the panoramas, which may impair an agent's spatial understanding. Thus, we propose a new map-based pre-training paradigm that is spatial-aware for use in VLN. Concretely, we build a local metric map to explicitly aggregate incomplete observations and remove duplicates, while modeling navigation dependency in a global topological map. This hybrid design can balance the demand of VLN for both short-term reasoning and long-term planning. Then, based on the hybrid map, we devise a pre-training framework to learn a multimodal map representation, which enhances spatial-aware cross-modal reasoning thereby facilitating the language-guided navigation goal. Extensive experiments demonstrate the effectiveness of the map-based pre-training route for VLN, and the proposed method achieves state-of-the-art on four VLN benchmarks.

Results

TaskDatasetMetricValueModel
Visual NavigationR2Rspl0.6BEV-BERT

Related Papers

Rethinking the Embodied Gap in Vision-and-Language Navigation: A Holistic Study of Physical and Visual Disparities2025-07-17NavMorph: A Self-Evolving World Model for Vision-and-Language Navigation in Continuous Environments2025-06-30LeVERB: Humanoid Whole-Body Control with Latent Vision-Language Instruction2025-06-16Grounded Vision-Language Navigation for UAVs with Open-Vocabulary Goal Understanding2025-06-12A Navigation Framework Utilizing Vision-Language Models2025-06-11Enhancing Safety of Foundation Models for Visual Navigation through Collision Avoidance via Repulsive Estimation2025-06-04Disrupting Vision-Language Model-Driven Navigation Services via Adversarial Object Fusion2025-05-29Cross from Left to Right Brain: Adaptive Text Dreamer for Vision-and-Language Navigation2025-05-27