Airbert: In-domain Pretraining for Vision-and-Language Navigation

Pierre-Louis Guhur, Makarand Tapaswi, ShiZhe Chen, Ivan Laptev, Cordelia Schmid

2021-08-20ICCV 2021 10Referring Expression Navigate Vision and Language Navigation

Abstract

Vision-and-language navigation (VLN) aims to enable embodied agents to navigate in realistic environments using natural language instructions. Given the scarcity of domain-specific training data and the high diversity of image and language inputs, the generalization of VLN agents to unseen environments remains challenging. Recent methods explore pretraining to improve generalization, however, the use of generic image-caption datasets or existing small-scale VLN environments is suboptimal and results in limited improvements. In this work, we introduce BnB, a large-scale and diverse in-domain VLN dataset. We first collect image-caption (IC) pairs from hundreds of thousands of listings from online rental marketplaces. Using IC pairs we next propose automatic strategies to generate millions of VLN path-instruction (PI) pairs. We further propose a shuffling loss that improves the learning of temporal order inside PI pairs. We use BnB pretrain our Airbert model that can be adapted to discriminative and generative settings and show that it outperforms state of the art for Room-to-Room (R2R) navigation and Remote Referring Expression (REVERIE) benchmarks. Moreover, our in-domain pretraining significantly increases performance on a challenging few-shot VLN evaluation, where we train the model only on VLN instructions from a few houses.

Results

Task	Dataset	Metric	Value	Model
Vision and Language Navigation	VLN Challenge	error	2.58	Airbert
Vision and Language Navigation	VLN Challenge	length	686.54	Airbert
Vision and Language Navigation	VLN Challenge	oracle success	0.99	Airbert
Vision and Language Navigation	VLN Challenge	spl	0.01	Airbert
Vision and Language Navigation	VLN Challenge	success	0.78	Airbert

Related Papers

Rethinking the Embodied Gap in Vision-and-Language Navigation: A Holistic Study of Physical and Visual Disparities2025-07-17 Vision-based Perception for Autonomous Vehicles in Obstacle Avoidance Scenarios2025-07-16 CogDDN: A Cognitive Demand-Driven Navigation with Decision Optimization and Dual-Process Thinking2025-07-15 Privacy-Preserving Multi-Stage Fall Detection Framework with Semi-supervised Federated Learning and Robotic Vision Confirmation2025-07-14 Automating MD simulations for Proteins using Large language Models: NAMD-Agent2025-07-10 Graph Learning2025-07-08 Visual Hand Gesture Recognition with Deep Learning: A Comprehensive Review of Methods, Datasets, Challenges and Future Research Directions2025-07-06 STRUCTSENSE: A Task-Agnostic Agentic Framework for Structured Information Extraction with Human-In-The-Loop Evaluation and Benchmarking2025-07-04