TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Improving Vision-and-Language Navigation with Image-Text P...

Improving Vision-and-Language Navigation with Image-Text Pairs from the Web

Arjun Majumdar, Ayush Shrivastava, Stefan Lee, Peter Anderson, Devi Parikh, Dhruv Batra

2020-04-30ECCV 2020 8Vision and Language Navigation
PaperPDFCode(official)

Abstract

Following a navigation instruction such as 'Walk down the stairs and stop at the brown sofa' requires embodied AI agents to ground scene elements referenced via language (e.g. 'stairs') to visual content in the environment (pixels corresponding to 'stairs'). We ask the following question -- can we leverage abundant 'disembodied' web-scraped vision-and-language corpora (e.g. Conceptual Captions) to learn visual groundings (what do 'stairs' look like?) that improve performance on a relatively data-starved embodied perception task (Vision-and-Language Navigation)? Specifically, we develop VLN-BERT, a visiolinguistic transformer-based model for scoring the compatibility between an instruction ('...stop at the brown sofa') and a sequence of panoramic RGB images captured by the agent. We demonstrate that pretraining VLN-BERT on image-text pairs from the web before fine-tuning on embodied path-instruction data significantly improves performance on VLN -- outperforming the prior state-of-the-art in the fully-observed setting by 4 absolute percentage points on success rate. Ablations of our pretraining curriculum show each stage to be impactful -- with their combination resulting in further positive synergistic effects.

Results

TaskDatasetMetricValueModel
Vision and Language NavigationVLN Challengeerror3.09VLN-Bert
Vision and Language NavigationVLN Challengelength686.62VLN-Bert
Vision and Language NavigationVLN Challengeoracle success0.99VLN-Bert
Vision and Language NavigationVLN Challengespl0.01VLN-Bert
Vision and Language NavigationVLN Challengesuccess0.73VLN-Bert

Related Papers

Rethinking the Embodied Gap in Vision-and-Language Navigation: A Holistic Study of Physical and Visual Disparities2025-07-17NavMorph: A Self-Evolving World Model for Vision-and-Language Navigation in Continuous Environments2025-06-30Grounded Vision-Language Navigation for UAVs with Open-Vocabulary Goal Understanding2025-06-12A Navigation Framework Utilizing Vision-Language Models2025-06-11Disrupting Vision-Language Model-Driven Navigation Services via Adversarial Object Fusion2025-05-29Cross from Left to Right Brain: Adaptive Text Dreamer for Vision-and-Language Navigation2025-05-27FlightGPT: Towards Generalizable and Interpretable UAV Vision-and-Language Navigation with Vision-Language Models2025-05-19Dynam3D: Dynamic Layered 3D Tokens Empower VLM for Vision-and-Language Navigation2025-05-16