Tasks
SotA
Datasets
Papers
Methods
Submit
About
SotA
/
Natural Language Processing
/
Visual Question Answering (VQA)
/
InfographicVQA
Visual Question Answering (VQA) on InfographicVQA
Metric: ANLS (higher is better)
Leaderboard
Dataset
Loading chart...
Results
Submit a result
Hide extra data
Export CSV
#
Model
↕
ANLS
▼
Extra Data
Paper
Date
↕
Code
1
Gemini Ultra (pixel only)
80.3
No
Gemini: A Family of Highly Capable Multimodal Mo...
2023-12-19
Code
2
SMoLA-PaLI-X Specialist
66.2
Yes
Omni-SMoLA: Boosting Generalist Multimodal Model...
2023-12-01
-
3
ScreenAI 5B (4.62 B params, w/ OCR)
65.9
Yes
ScreenAI: A Vision-Language Model for UI and Inf...
2024-02-07
Code
4
SMoLA-PaLI-X Generalist
65.6
Yes
Omni-SMoLA: Boosting Generalist Multimodal Model...
2023-12-01
-
5
UDOP (aux)
63
Yes
Unifying Vision, Text, and Layout for Universal ...
2022-12-05
Code
6
PaLI-3 (w/ OCR)
62.4
No
PaLI-3 Vision Language Models: Smaller, Faster, ...
2023-10-13
Code
7
TILT-Large
61.2
Yes
Going Full-TILT Boogie on Document Understanding...
2021-02-18
Code
8
PaLI-3
57.8
No
PaLI-3 Vision Language Models: Smaller, Faster, ...
2023-10-13
Code
9
ChatGPT 3.5 with LAPDoc Prompt (SpatialFormat)
54.9
No
LAPDoc: Layout-Aware Prompting for Documents
2024-02-15
-
10
PaLI-X (Single-task FT w/ OCR)
54.8
Yes
PaLI-X: On Scaling up a Multilingual Vision and ...
2023-05-29
Code
11
Claude + LATIN-Prompt
54.51
No
Layout and Task Aware Instruction Prompt for Zer...
2023-06-01
Code
12
PaLI-X (Multi-task FT)
50.7
Yes
PaLI-X: On Scaling up a Multilingual Vision and ...
2023-05-29
Code
13
PaLI-X (Single-task FT)
49.2
Yes
PaLI-X: On Scaling up a Multilingual Vision and ...
2023-05-29
Code
14
GPT-3.5 + LATIN-Prompt
48.98
No
Layout and Task Aware Instruction Prompt for Zer...
2023-06-01
Code
15
DocFormerv2-large
48.8
Yes
DocFormerv2: Local Features for Document Underst...
2023-06-02
Code
16
UDOP
47.4
No
Unifying Vision, Text, and Layout for Universal ...
2022-12-05
Code
17
DUBLIN (variable resolution)
42.6
Yes
DUBLIN -- Document Understanding By Language-Ima...
2023-05-23
-
18
Pix2Struct-large
40
No
Pix2Struct: Screenshot Parsing as Pretraining fo...
2022-10-07
Code
19
Pix2Struct-base
38.2
No
Pix2Struct: Screenshot Parsing as Pretraining fo...
2022-10-07
Code
20
MatCha
37.2
No
MatCha: Enhancing Visual Language Pretraining wi...
2022-12-19
Code
21
DUBLIN
36.82
Yes
DUBLIN -- Document Understanding By Language-Ima...
2023-05-23
-