Tasks
SotA
Datasets
Papers
Methods
Submit
About
SotA
/
Reasoning
/
Natural Language Visual Grounding
/
ScreenSpot
Natural Language Visual Grounding on ScreenSpot
Metric: Accuracy (%) (higher is better)
Leaderboard
Dataset
Loading chart...
Results
Submit a result
Export CSV
Sort:
Accuracy (%) (best first)
Accuracy (%) (worst first)
Date (newest first)
Date (oldest first)
Model name (A→Z)
#
Model
↕
Accuracy (%)
▼
Extra Data
Paper
Date
↕
Code
1
UGround-V1-7B
86.34
No
Navigating the Digital World as Humans Do: Unive...
2024-10-07
Code
2
Aguvis-7B
83
No
Aguvis: Unified Pure Vision Agents for Autonomou...
2024-12-05
Code
3
OS-Atlas-Base-7B
82.47
No
OS-ATLAS: A Foundation Action Model for Generali...
2024-10-30
Code
4
Aria-UI
81.1
No
Aria-UI: Visual Grounding for GUI Instructions
2024-12-20
Code
5
Aguvis-G-7B
81
No
Aguvis: Unified Pure Vision Agents for Autonomou...
2024-12-05
Code
6
UGround-V1-2B
77.67
No
Navigating the Digital World as Humans Do: Unive...
2024-10-07
Code
7
ShowUI
75.1
No
ShowUI: One Vision-Language-Action Model for GUI...
2024-11-26
Code
8
ShowUI-G
75
No
ShowUI: One Vision-Language-Action Model for GUI...
2024-11-26
Code
9
UGround
73.3
No
Navigating the Digital World as Humans Do: Unive...
2024-10-07
Code
10
OmniParser
73
No
OmniParser for Pure Vision Based GUI Agent
2024-08-01
Code
11
OS-Atlas-Base-4B
68
No
OS-ATLAS: A Foundation Action Model for Generali...
2024-10-30
Code
12
SeeClick
53.4
No
SeeClick: Harnessing GUI Grounding for Advanced ...
2024-01-17
Code
13
CogAgent
47.4
No
CogAgent: A Visual Language Model for GUI Agents
2023-12-14
Code
14
Qwen2-VL-7B
42.1
No
Qwen2-VL: Enhancing Vision-Language Model's Perc...
2024-09-18
Code
15
Qwen-GUI
28.6
No
GUICourse: From General Vision Language Models t...
2024-06-17
Code
16
MiniGPT-v2
5.7
No
MiniGPT-v2: large language model as a unified in...
2023-10-14
Code
17
Groma
5.2
No
Groma: Localized Visual Tokenization for Groundi...
2024-04-19
Code
18
Qwen-VL
5.2
No
Qwen-VL: A Versatile Vision-Language Model for U...
2023-08-24
Code
#1
UGround-V1-7B
SOTA
86.34
Accuracy (%)
· 2024-10-07
Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents
Code
#2
Aguvis-7B
83
Accuracy (%)
· 2024-12-05
Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction
Code
#3
OS-Atlas-Base-7B
82.47
Accuracy (%)
· 2024-10-30
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
Code
#4
Aria-UI
81.1
Accuracy (%)
· 2024-12-20
Aria-UI: Visual Grounding for GUI Instructions
Code
#5
Aguvis-G-7B
81
Accuracy (%)
· 2024-12-05
Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction
Code
#6
UGround-V1-2B
77.67
Accuracy (%)
· 2024-10-07
Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents
Code
#7
ShowUI
75.1
Accuracy (%)
· 2024-11-26
ShowUI: One Vision-Language-Action Model for GUI Visual Agent
Code
#8
ShowUI-G
75
Accuracy (%)
· 2024-11-26
ShowUI: One Vision-Language-Action Model for GUI Visual Agent
Code
#9
UGround
73.3
Accuracy (%)
· 2024-10-07
Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents
Code
#10
OmniParser
SOTA
73
Accuracy (%)
· 2024-08-01
OmniParser for Pure Vision Based GUI Agent
Code
#11
OS-Atlas-Base-4B
68
Accuracy (%)
· 2024-10-30
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
Code
#12
SeeClick
SOTA
53.4
Accuracy (%)
· 2024-01-17
SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents
Code
#13
CogAgent
SOTA
47.4
Accuracy (%)
· 2023-12-14
CogAgent: A Visual Language Model for GUI Agents
Code
#14
Qwen2-VL-7B
42.1
Accuracy (%)
· 2024-09-18
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Code
#15
Qwen-GUI
28.6
Accuracy (%)
· 2024-06-17
GUICourse: From General Vision Language Models to Versatile GUI Agents
Code
#16
MiniGPT-v2
SOTA
5.7
Accuracy (%)
· 2023-10-14
MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning
Code
#17
Groma
5.2
Accuracy (%)
· 2024-04-19
Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models
Code
#18
Qwen-VL
SOTA
5.2
Accuracy (%)
· 2023-08-24
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Code