TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/OS-ATLAS: A Foundation Action Model for Generalist GUI Age...

OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, Yu Qiao

2024-10-30Natural Language Visual Grounding
PaperPDFCodeCode(official)

Abstract

Existing efforts in building GUI agents heavily rely on the availability of robust commercial Vision-Language Models (VLMs) such as GPT-4o and GeminiProVision. Practitioners are often reluctant to use open-source VLMs due to their significant performance lag compared to their closed-source counterparts, particularly in GUI grounding and Out-Of-Distribution (OOD) scenarios. To facilitate future research in this area, we developed OS-Atlas - a foundational GUI action model that excels at GUI grounding and OOD agentic tasks through innovations in both data and modeling. We have invested significant engineering effort in developing an open-source toolkit for synthesizing GUI grounding data across multiple platforms, including Windows, Linux, MacOS, Android, and the web. Leveraging this toolkit, we are releasing the largest open-source cross-platform GUI grounding corpus to date, which contains over 13 million GUI elements. This dataset, combined with innovations in model training, provides a solid foundation for OS-Atlas to understand GUI screenshots and generalize to unseen interfaces. Through extensive evaluation across six benchmarks spanning three different platforms (mobile, desktop, and web), OS-Atlas demonstrates significant performance improvements over previous state-of-the-art models. Our evaluation also uncovers valuable insights into continuously improving and scaling the agentic capabilities of open-source VLMs.

Results

TaskDatasetMetricValueModel
Natural Language Visual GroundingScreenSpotAccuracy (%)82.47OS-Atlas-Base-7B
Natural Language Visual GroundingScreenSpotAccuracy (%)68OS-Atlas-Base-4B

Related Papers

Aria-UI: Visual Grounding for GUI Instructions2024-12-20Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction2024-12-05ShowUI: One Vision-Language-Action Model for GUI Visual Agent2024-11-26Improved GUI Grounding via Iterative Narrowing2024-11-18Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents2024-10-07Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution2024-09-18OmniParser for Pure Vision Based GUI Agent2024-08-01GUICourse: From General Vision Language Models to Versatile GUI Agents2024-06-17