TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Multimodal Text Style Transfer for Outdoor Vision-and-Lang...

Multimodal Text Style Transfer for Outdoor Vision-and-Language Navigation

Wanrong Zhu, Xin Eric Wang, Tsu-Jui Fu, An Yan, Pradyumna Narayana, Kazoo Sone, Sugato Basu, William Yang Wang

2020-07-01EACL 2021 2Style TransferText Style TransferVision and Language Navigation
PaperPDFCode

Abstract

One of the most challenging topics in Natural Language Processing (NLP) is visually-grounded language understanding and reasoning. Outdoor vision-and-language navigation (VLN) is such a task where an agent follows natural language instructions and navigates a real-life urban environment. Due to the lack of human-annotated instructions that illustrate intricate urban scenes, outdoor VLN remains a challenging task to solve. This paper introduces a Multimodal Text Style Transfer (MTST) learning approach and leverages external multimodal resources to mitigate data scarcity in outdoor navigation tasks. We first enrich the navigation data by transferring the style of the instructions generated by Google Maps API, then pre-train the navigator with the augmented external outdoor navigation dataset. Experimental results show that our MTST learning approach is model-agnostic, and our MTST approach significantly outperforms the baseline models on the outdoor VLN task, improving task completion rate by 8.7% relatively on the test set.

Results

TaskDatasetMetricValueModel
Vision and Language NavigationTouchdown DatasetTask Completion (TC)16.2VLN Transformer +M-50 +style
Vision and Language NavigationTouchdown DatasetTask Completion (TC)14.9VLN Transformer
Vision and Language NavigationTouchdown DatasetTask Completion (TC)11.9Gated Attention (GA)
Vision and Language NavigationTouchdown DatasetTask Completion (TC)11.8RConcat

Related Papers

Rethinking the Embodied Gap in Vision-and-Language Navigation: A Holistic Study of Physical and Visual Disparities2025-07-17Transferring Styles for Reduced Texture Bias and Improved Robustness in Semantic Segmentation Networks2025-07-14AnyI2V: Animating Any Conditional Image with Motion Control2025-07-03Hita: Holistic Tokenizer for Autoregressive Image Generation2025-07-03NavMorph: A Self-Evolving World Model for Vision-and-Language Navigation in Continuous Environments2025-06-30SA-LUT: Spatial Adaptive 4D Look-Up Table for Photorealistic Style Transfer2025-06-16Grounded Vision-Language Navigation for UAVs with Open-Vocabulary Goal Understanding2025-06-12Fine-Grained control over Music Generation with Activation Steering2025-06-11