TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/WebLINX: Real-World Website Navigation with Multi-Turn Dia...

WebLINX: Real-World Website Navigation with Multi-Turn Dialogue

Xing Han Lù, Zdeněk Kasner, Siva Reddy

2024-02-08Text GenerationConversational Web NavigationVision and Language Navigation
PaperPDFCode(official)Code(official)

Abstract

We propose the problem of conversational web navigation, where a digital agent controls a web browser and follows user instructions to solve real-world tasks in a multi-turn dialogue fashion. To support this problem, we introduce WEBLINX - a large-scale benchmark of 100K interactions across 2300 expert demonstrations of conversational web navigation. Our benchmark covers a broad range of patterns on over 150 real-world websites and can be used to train and evaluate agents in diverse scenarios. Due to the magnitude of information present, Large Language Models (LLMs) cannot process entire web pages in real-time. To solve this bottleneck, we design a retrieval-inspired model that efficiently prunes HTML pages by ranking relevant elements. We use the selected elements, along with screenshots and action history, to assess a variety of models for their ability to replicate human behavior when navigating the web. Our experiments span from small text-only to proprietary multimodal LLMs. We find that smaller finetuned decoders surpass the best zero-shot LLMs (including GPT-4V), but also larger finetuned multimodal models which were explicitly pretrained on screenshots. However, all finetuned models struggle to generalize to unseen websites. Our findings highlight the need for large multimodal models that can generalize to novel settings. Our code, data and models are available for research: https://mcgill-nlp.github.io/weblinx

Results

TaskDatasetMetricValueModel
Conversational Web NavigationWebLINXElement (IoU)22.82Llama-2-13B
Conversational Web NavigationWebLINXIntent Match81.91Llama-2-13B
Conversational Web NavigationWebLINXOverall score25.21Llama-2-13B
Conversational Web NavigationWebLINXText (F1)26.6Llama-2-13B
Conversational Web NavigationWebLINXElement (IoU)22.6S-LLaMA-2.7B
Conversational Web NavigationWebLINXIntent Match84S-LLaMA-2.7B
Conversational Web NavigationWebLINXOverall score25.02S-LLaMA-2.7B
Conversational Web NavigationWebLINXText (F1)27.17S-LLaMA-2.7B
Conversational Web NavigationWebLINXElement (IoU)22.26Llama-2-7B
Conversational Web NavigationWebLINXIntent Match82.64Llama-2-7B
Conversational Web NavigationWebLINXOverall score24.57Llama-2-7B
Conversational Web NavigationWebLINXText (F1)26.5Llama-2-7B
Conversational Web NavigationWebLINXElement (IoU)20.31Flan-T5-3B
Conversational Web NavigationWebLINXIntent Match81.14Flan-T5-3B
Conversational Web NavigationWebLINXOverall score23.77Flan-T5-3B
Conversational Web NavigationWebLINXText (F1)25.75Flan-T5-3B
Conversational Web NavigationWebLINXElement (IoU)20.54S-LLaMA-1.3B
Conversational Web NavigationWebLINXIntent Match83.32S-LLaMA-1.3B
Conversational Web NavigationWebLINXOverall score23.73S-LLaMA-1.3B
Conversational Web NavigationWebLINXText (F1)25.85S-LLaMA-1.3B
Conversational Web NavigationWebLINXElement (IoU)18.64GPT-3.5F
Conversational Web NavigationWebLINXIntent Match77.56GPT-3.5F
Conversational Web NavigationWebLINXOverall score21.22GPT-3.5F
Conversational Web NavigationWebLINXText (F1)22.39GPT-3.5F
Conversational Web NavigationWebLINXElement (IoU)16.5MindAct-3B
Conversational Web NavigationWebLINXIntent Match79.89MindAct-3B
Conversational Web NavigationWebLINXOverall score20.94MindAct-3B
Conversational Web NavigationWebLINXText (F1)23.16MindAct-3B
Conversational Web NavigationWebLINXElement (IoU)15.7Fuyu-8B
Conversational Web NavigationWebLINXIntent Match80.07Fuyu-8B
Conversational Web NavigationWebLINXOverall score19.97Fuyu-8B
Conversational Web NavigationWebLINXText (F1)22.3Fuyu-8B
Conversational Web NavigationWebLINXElement (IoU)15.36Flan-T5-780M
Conversational Web NavigationWebLINXIntent Match80.02Flan-T5-780M
Conversational Web NavigationWebLINXOverall score17.27Flan-T5-780M
Conversational Web NavigationWebLINXText (F1)14.05Flan-T5-780M
Conversational Web NavigationWebLINXElement (IoU)8.28Pix2Act-1.3B
Conversational Web NavigationWebLINXIntent Match81.8Pix2Act-1.3B
Conversational Web NavigationWebLINXOverall score16.88Pix2Act-1.3B
Conversational Web NavigationWebLINXText (F1)25.21Pix2Act-1.3B
Conversational Web NavigationWebLINXElement (IoU)13.39MindAct-780M
Conversational Web NavigationWebLINXIntent Match75.87MindAct-780M
Conversational Web NavigationWebLINXOverall score15.13MindAct-780M
Conversational Web NavigationWebLINXText (F1)13.58MindAct-780M
Conversational Web NavigationWebLINXElement (IoU)14.86Flan-T5-250M
Conversational Web NavigationWebLINXIntent Match79.69Flan-T5-250M
Conversational Web NavigationWebLINXOverall score14.99Flan-T5-250M
Conversational Web NavigationWebLINXText (F1)9.21Flan-T5-250M
Conversational Web NavigationWebLINXElement (IoU)12.05MindAct-250M
Conversational Web NavigationWebLINXIntent Match74.25MindAct-250M
Conversational Web NavigationWebLINXOverall score12.63MindAct-250M
Conversational Web NavigationWebLINXText (F1)7.67MindAct-250M
Conversational Web NavigationWebLINXElement (IoU)6.2Pix2Act-282M
Conversational Web NavigationWebLINXIntent Match79.71Pix2Act-282M
Conversational Web NavigationWebLINXOverall score12.51Pix2Act-282M
Conversational Web NavigationWebLINXText (F1)16.4Pix2Act-282M
Conversational Web NavigationWebLINXElement (IoU)10.85GPT-4T (Zero-Shot)
Conversational Web NavigationWebLINXIntent Match41.66GPT-4T (Zero-Shot)
Conversational Web NavigationWebLINXOverall score10.72GPT-4T (Zero-Shot)
Conversational Web NavigationWebLINXText (F1)6.75GPT-4T (Zero-Shot)
Conversational Web NavigationWebLINXElement (IoU)10.91GPT-4V (Zero-Shot)
Conversational Web NavigationWebLINXIntent Match42.36GPT-4V (Zero-Shot)
Conversational Web NavigationWebLINXOverall score10.45GPT-4V (Zero-Shot)
Conversational Web NavigationWebLINXText (F1)6.21GPT-4V (Zero-Shot)
Conversational Web NavigationWebLINXElement (IoU)8.62GPT-3.5T (Zero-Shot)
Conversational Web NavigationWebLINXIntent Match42.77GPT-3.5T (Zero-Shot)
Conversational Web NavigationWebLINXOverall score8.51GPT-3.5T (Zero-Shot)
Conversational Web NavigationWebLINXText (F1)3.45GPT-3.5T (Zero-Shot)

Related Papers

Making Language Model a Hierarchical Classifier and Generator2025-07-17Rethinking the Embodied Gap in Vision-and-Language Navigation: A Holistic Study of Physical and Visual Disparities2025-07-17Mitigating Object Hallucinations via Sentence-Level Early Intervention2025-07-16The Devil behind the mask: An emergent safety vulnerability of Diffusion LLMs2025-07-15Seq vs Seq: An Open Suite of Paired Encoders and Decoders2025-07-15Hashed Watermark as a Filter: Defeating Forging and Overwriting Attacks in Weight-based Neural Network Watermarking2025-07-15Exploiting Leaderboards for Large-Scale Distribution of Malicious Models2025-07-11CLI-RAG: A Retrieval-Augmented Framework for Clinically Structured and Context Aware Text Generation with LLMs2025-07-09