Natural Language Visual Grounding on ScreenSpot

Metric: Accuracy (%) (higher is better)

LeaderboardDataset

Loading chart...

Results

Sort:

#	Model↕	Accuracy (%)▼	Extra Data	Paper	Date↕	Code
1	UGround-V1-7B	86.34	No	Navigating the Digital World as Humans Do: Unive...	2024-10-07	Code
2	Aguvis-7B	83	No	Aguvis: Unified Pure Vision Agents for Autonomou...	2024-12-05	Code
3	OS-Atlas-Base-7B	82.47	No	OS-ATLAS: A Foundation Action Model for Generali...	2024-10-30	Code
4	Aria-UI	81.1	No	Aria-UI: Visual Grounding for GUI Instructions	2024-12-20	Code
5	Aguvis-G-7B	81	No	Aguvis: Unified Pure Vision Agents for Autonomou...	2024-12-05	Code
6	UGround-V1-2B	77.67	No	Navigating the Digital World as Humans Do: Unive...	2024-10-07	Code
7	ShowUI	75.1	No	ShowUI: One Vision-Language-Action Model for GUI...	2024-11-26	Code
8	ShowUI-G	75	No	ShowUI: One Vision-Language-Action Model for GUI...	2024-11-26	Code
9	UGround	73.3	No	Navigating the Digital World as Humans Do: Unive...	2024-10-07	Code
10	OmniParser	73	No	OmniParser for Pure Vision Based GUI Agent	2024-08-01	Code
11	OS-Atlas-Base-4B	68	No	OS-ATLAS: A Foundation Action Model for Generali...	2024-10-30	Code
12	SeeClick	53.4	No	SeeClick: Harnessing GUI Grounding for Advanced ...	2024-01-17	Code
13	CogAgent	47.4	No	CogAgent: A Visual Language Model for GUI Agents	2023-12-14	Code
14	Qwen2-VL-7B	42.1	No	Qwen2-VL: Enhancing Vision-Language Model's Perc...	2024-09-18	Code
15	Qwen-GUI	28.6	No	GUICourse: From General Vision Language Models t...	2024-06-17	Code
16	MiniGPT-v2	5.7	No	MiniGPT-v2: large language model as a unified in...	2023-10-14	Code
17	Groma	5.2	No	Groma: Localized Visual Tokenization for Groundi...	2024-04-19	Code
18	Qwen-VL	5.2	No	Qwen-VL: A Versatile Vision-Language Model for U...	2023-08-24	Code

#1UGround-V1-7BSOTA
86.34
Accuracy (%)· 2024-10-07
Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents Code
#2Aguvis-7B
83
Accuracy (%)· 2024-12-05
Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction Code
#3OS-Atlas-Base-7B
82.47
Accuracy (%)· 2024-10-30
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents Code
#4Aria-UI
81.1
Accuracy (%)· 2024-12-20
Aria-UI: Visual Grounding for GUI Instructions Code
#5Aguvis-G-7B
81
Accuracy (%)· 2024-12-05
Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction Code
#6UGround-V1-2B
77.67
Accuracy (%)· 2024-10-07
Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents Code
#7ShowUI
75.1
Accuracy (%)· 2024-11-26
ShowUI: One Vision-Language-Action Model for GUI Visual Agent Code
#8ShowUI-G
75
Accuracy (%)· 2024-11-26
ShowUI: One Vision-Language-Action Model for GUI Visual Agent Code
#9UGround
73.3
Accuracy (%)· 2024-10-07
Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents Code
#10OmniParserSOTA
73
Accuracy (%)· 2024-08-01
OmniParser for Pure Vision Based GUI Agent Code
#11OS-Atlas-Base-4B
68
Accuracy (%)· 2024-10-30
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents Code
#12SeeClickSOTA
53.4
Accuracy (%)· 2024-01-17
SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents Code
#13CogAgentSOTA
47.4
Accuracy (%)· 2023-12-14
CogAgent: A Visual Language Model for GUI Agents Code
#14Qwen2-VL-7B
42.1
Accuracy (%)· 2024-09-18
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution Code
#15Qwen-GUI
28.6
Accuracy (%)· 2024-06-17
GUICourse: From General Vision Language Models to Versatile GUI Agents Code
#16MiniGPT-v2SOTA
5.7
Accuracy (%)· 2023-10-14
MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning Code
#17Groma
5.2
Accuracy (%)· 2024-04-19
Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models Code
#18Qwen-VLSOTA
5.2
Accuracy (%)· 2023-08-24
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond Code