TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/ShowUI: One Vision-Language-Action Model for GUI Visual Ag...

ShowUI: One Vision-Language-Action Model for GUI Visual Agent

Kevin Qinghong Lin, Linjie Li, Difei Gao, Zhengyuan Yang, Shiwei Wu, Zechen Bai, Weixian Lei, Lijuan Wang, Mike Zheng Shou

2024-11-26CVPR 2025 1Instruction FollowingNatural Language Visual GroundingVision-Language-Action
PaperPDFCode(official)

Abstract

Building Graphical User Interface (GUI) assistants holds significant promise for enhancing human workflow productivity. While most agents are language-based, relying on closed-source API with text-rich meta-information (e.g., HTML or accessibility tree), they show limitations in perceiving UI visuals as humans do, highlighting the need for GUI visual agents. In this work, we develop a vision-language-action model in digital world, namely ShowUI, which features the following innovations: (i) UI-Guided Visual Token Selection to reduce computational costs by formulating screenshots as an UI connected graph, adaptively identifying their redundant relationship and serve as the criteria for token selection during self-attention blocks; (ii) Interleaved Vision-Language-Action Streaming that flexibly unifies diverse needs within GUI tasks, enabling effective management of visual-action history in navigation or pairing multi-turn query-action sequences per screenshot to enhance training efficiency; (iii) Small-scale High-quality GUI Instruction-following Datasets by careful data curation and employing a resampling strategy to address significant data type imbalances. With above components, ShowUI, a lightweight 2B model using 256K data, achieves a strong 75.1% accuracy in zero-shot screenshot grounding. Its UI-guided token selection further reduces 33% of redundant visual tokens during training and speeds up the performance by 1.4x. Navigation experiments across web Mind2Web, mobile AITW, and online MiniWob environments further underscore the effectiveness and potential of our model in advancing GUI visual agents. The models are available at https://github.com/showlab/ShowUI.

Results

TaskDatasetMetricValueModel
Natural Language Visual GroundingScreenSpotAccuracy (%)75.1ShowUI
Natural Language Visual GroundingScreenSpotAccuracy (%)75ShowUI-G

Related Papers

AnyCap Project: A Unified Framework, Dataset, and Benchmark for Controllable Omni-modal Captioning2025-07-17LaViPlan : Language-Guided Visual Path Planning with RLVR2025-07-17AnyPos: Automated Task-Agnostic Actions for Bimanual Manipulation2025-07-17How Many Instructions Can LLMs Follow at Once?2025-07-15DrafterBench: Benchmarking Large Language Models for Tasks Automation in Civil Engineering2025-07-15Vision Language Action Models in Robotic Manipulation: A Systematic Review2025-07-14Multilingual Multimodal Software Developer for Code Generation2025-07-11TuneShield: Mitigating Toxicity in Conversational AI while Fine-tuning on Untrusted Data2025-07-08