Tasks
SotA
Datasets
Papers
Methods
Submit
About
SotA
/
Natural Language Processing
/
Visual Question Answering (VQA)
/
DocVQA test
Visual Question Answering (VQA) on DocVQA test
Metric: ANLS (higher is better)
Leaderboard
Dataset
Loading chart...
Results
Submit a result
Hide extra data
Export CSV
#
Model
↕
ANLS
▼
Extra Data
Paper
Date
↕
Code
1
Human
0.9436
Yes
DocVQA: A Dataset for VQA on Document Images
2020-07-01
Code
2
MLCD-Embodied-7B
0.916
Yes
Multi-label Cluster Discrimination for Visual Re...
2024-07-24
Code
3
SMoLA-PaLI-X Specialist
0.908
Yes
Omni-SMoLA: Boosting Generalist Multimodal Model...
2023-12-01
-
4
SMoLA-PaLI-X Generalist
0.906
Yes
Omni-SMoLA: Boosting Generalist Multimodal Model...
2023-12-01
-
5
Qwen-VL-Plus
0.9024
Yes
Qwen-VL: A Versatile Vision-Language Model for U...
2023-08-24
Code
6
ScreenAI 5B (4.62 B params, w/OCR)
0.8988
Yes
ScreenAI: A Vision-Language Model for UI and Inf...
2024-02-07
Code
7
PaLI-3 (w/ OCR)
0.886
No
PaLI-3 Vision Language Models: Smaller, Faster, ...
2023-10-13
Code
8
ERNIE-Layout large (ensemble)
0.8841
No
ERNIE-Layout: Layout Knowledge Enhanced Pre-trai...
2022-10-12
Code
9
GPT-4
0.884
No
Layout and Task Aware Instruction Prompt for Zer...
2023-06-01
Code
10
DocFormerv2-large
0.8784
Yes
DocFormerv2: Local Features for Document Underst...
2023-06-02
Code
11
UDOP (aux)
0.878
Yes
Unifying Vision, Text, and Layout for Universal ...
2022-12-05
Code
12
PaLI-3
0.876
No
PaLI-3 Vision Language Models: Smaller, Faster, ...
2023-10-13
Code
13
TILT-Large
0.8705
Yes
Going Full-TILT Boogie on Document Understanding...
2021-02-18
Code
14
PaLI-X (Single-task FT w/ OCR)
0.868
Yes
PaLI-X: On Scaling up a Multilingual Vision and ...
2023-05-29
Code
15
LayoutLMv2LARGE
0.8672
No
LayoutLMv2: Multi-modal Pre-training for Visuall...
2020-12-29
Code
16
ERNIE-Layout large
0.8486
No
ERNIE-Layout: Layout Knowledge Enhanced Pre-trai...
2022-10-12
Code
17
UDOP
0.847
No
Unifying Vision, Text, and Layout for Universal ...
2022-12-05
Code
18
TILT-Base
0.8392
Yes
Going Full-TILT Boogie on Document Understanding...
2021-02-18
Code
19
Claude + LATIN-Prompt
0.8336
No
Layout and Task Aware Instruction Prompt for Zer...
2023-06-01
Code
20
GPT-3.5 + LATIN-Prompt
0.8255
No
Layout and Task Aware Instruction Prompt for Zer...
2023-06-01
Code
21
PaLI-X (Multi-task FT)
0.809
Yes
PaLI-X: On Scaling up a Multilingual Vision and ...
2023-05-29
Code
22
DUBLIN (variable resolution)
0.803
Yes
DUBLIN -- Document Understanding By Language-Ima...
2023-05-23
-
23
PaLI-X (Single-task FT)
0.8
Yes
PaLI-X: On Scaling up a Multilingual Vision and ...
2023-05-29
Code
24
DUBLIN
0.782
Yes
DUBLIN -- Document Understanding By Language-Ima...
2023-05-23
-
25
LayoutLMv2BASE
0.7808
No
LayoutLMv2: Multi-modal Pre-training for Visuall...
2020-12-29
Code
26
Pix2Struct-large
0.766
No
Pix2Struct: Screenshot Parsing as Pretraining fo...
2022-10-07
Code
27
MatCha
0.742
No
MatCha: Enhancing Visual Language Pretraining wi...
2022-12-19
Code
28
Pix2Struct-base
0.721
No
Pix2Struct: Screenshot Parsing as Pretraining fo...
2022-10-07
Code
29
Donut
0.675
No
OCR-free Document Understanding Transformer
2021-11-30
Code
30
BERT_LARGE_SQUAD_DOCVQA_FINETUNED_Baseline
0.665
Yes
DocVQA: A Dataset for VQA on Document Images
2020-07-01
Code
31
Qwen-VL
0.651
Yes
Qwen-VL: A Versatile Vision-Language Model for U...
2023-08-24
Code
32
Dessurt
0.632
No
End-to-end Document Recognition and Understandin...
2022-03-30
Code
33
Qwen-VL-Chat
0.626
Yes
Qwen-VL: A Versatile Vision-Language Model for U...
2023-08-24
Code