Visual Question Answering (VQA) on DocVQA test

Metric: ANLS (higher is better)

LeaderboardDataset

Loading chart...

Results

Hide extra data

Sort:

#	Model↕	ANLS▼	Extra Data	Paper	Date↕	Code
1	Human	0.9436	Yes	DocVQA: A Dataset for VQA on Document Images	2020-07-01	Code
2	MLCD-Embodied-7B	0.916	Yes	Multi-label Cluster Discrimination for Visual Re...	2024-07-24	Code
3	SMoLA-PaLI-X Specialist	0.908	Yes	Omni-SMoLA: Boosting Generalist Multimodal Model...	2023-12-01	-
4	SMoLA-PaLI-X Generalist	0.906	Yes	Omni-SMoLA: Boosting Generalist Multimodal Model...	2023-12-01	-
5	Qwen-VL-Plus	0.9024	Yes	Qwen-VL: A Versatile Vision-Language Model for U...	2023-08-24	Code
6	ScreenAI 5B (4.62 B params, w/OCR)	0.8988	Yes	ScreenAI: A Vision-Language Model for UI and Inf...	2024-02-07	Code
7	PaLI-3 (w/ OCR)	0.886	No	PaLI-3 Vision Language Models: Smaller, Faster, ...	2023-10-13	Code
8	ERNIE-Layout large (ensemble)	0.8841	No	ERNIE-Layout: Layout Knowledge Enhanced Pre-trai...	2022-10-12	Code
9	GPT-4	0.884	No	Layout and Task Aware Instruction Prompt for Zer...	2023-06-01	Code
10	DocFormerv2-large	0.8784	Yes	DocFormerv2: Local Features for Document Underst...	2023-06-02	Code
11	UDOP (aux)	0.878	Yes	Unifying Vision, Text, and Layout for Universal ...	2022-12-05	Code
12	PaLI-3	0.876	No	PaLI-3 Vision Language Models: Smaller, Faster, ...	2023-10-13	Code
13	TILT-Large	0.8705	Yes	Going Full-TILT Boogie on Document Understanding...	2021-02-18	Code
14	PaLI-X (Single-task FT w/ OCR)	0.868	Yes	PaLI-X: On Scaling up a Multilingual Vision and ...	2023-05-29	Code
15	LayoutLMv2LARGE	0.8672	No	LayoutLMv2: Multi-modal Pre-training for Visuall...	2020-12-29	Code
16	ERNIE-Layout large	0.8486	No	ERNIE-Layout: Layout Knowledge Enhanced Pre-trai...	2022-10-12	Code
17	UDOP	0.847	No	Unifying Vision, Text, and Layout for Universal ...	2022-12-05	Code
18	TILT-Base	0.8392	Yes	Going Full-TILT Boogie on Document Understanding...	2021-02-18	Code
19	Claude + LATIN-Prompt	0.8336	No	Layout and Task Aware Instruction Prompt for Zer...	2023-06-01	Code
20	GPT-3.5 + LATIN-Prompt	0.8255	No	Layout and Task Aware Instruction Prompt for Zer...	2023-06-01	Code
21	PaLI-X (Multi-task FT)	0.809	Yes	PaLI-X: On Scaling up a Multilingual Vision and ...	2023-05-29	Code
22	DUBLIN (variable resolution)	0.803	Yes	DUBLIN -- Document Understanding By Language-Ima...	2023-05-23	-
23	PaLI-X (Single-task FT)	0.8	Yes	PaLI-X: On Scaling up a Multilingual Vision and ...	2023-05-29	Code
24	DUBLIN	0.782	Yes	DUBLIN -- Document Understanding By Language-Ima...	2023-05-23	-
25	LayoutLMv2BASE	0.7808	No	LayoutLMv2: Multi-modal Pre-training for Visuall...	2020-12-29	Code
26	Pix2Struct-large	0.766	No	Pix2Struct: Screenshot Parsing as Pretraining fo...	2022-10-07	Code
27	MatCha	0.742	No	MatCha: Enhancing Visual Language Pretraining wi...	2022-12-19	Code
28	Pix2Struct-base	0.721	No	Pix2Struct: Screenshot Parsing as Pretraining fo...	2022-10-07	Code
29	Donut	0.675	No	OCR-free Document Understanding Transformer	2021-11-30	Code
30	BERT_LARGE_SQUAD_DOCVQA_FINETUNED_Baseline	0.665	Yes	DocVQA: A Dataset for VQA on Document Images	2020-07-01	Code
31	Qwen-VL	0.651	Yes	Qwen-VL: A Versatile Vision-Language Model for U...	2023-08-24	Code
32	Dessurt	0.632	No	End-to-end Document Recognition and Understandin...	2022-03-30	Code
33	Qwen-VL-Chat	0.626	Yes	Qwen-VL: A Versatile Vision-Language Model for U...	2023-08-24	Code

#1HumanSOTA
0.9436
ANLS· Extra Data· 2020-07-01
DocVQA: A Dataset for VQA on Document Images Code
#2MLCD-Embodied-7B
0.916
ANLS· Extra Data· 2024-07-24
Multi-label Cluster Discrimination for Visual Representation Learning Code
#3SMoLA-PaLI-X Specialist
0.908
ANLS· Extra Data· 2023-12-01
Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of Low-rank Experts
#4SMoLA-PaLI-X Generalist
0.906
ANLS· Extra Data· 2023-12-01
Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of Low-rank Experts
#5Qwen-VL-Plus
0.9024
ANLS· Extra Data· 2023-08-24
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond Code
#6ScreenAI 5B (4.62 B params, w/OCR)
0.8988
ANLS· Extra Data· 2024-02-07
ScreenAI: A Vision-Language Model for UI and Infographics Understanding Code
#7PaLI-3 (w/ OCR)
0.886
ANLS· 2023-10-13
PaLI-3 Vision Language Models: Smaller, Faster, Stronger Code
#8ERNIE-Layout large (ensemble)
0.8841
ANLS· 2022-10-12
ERNIE-Layout: Layout Knowledge Enhanced Pre-training for Visually-rich Document Understanding Code
#9GPT-4
0.884
ANLS· 2023-06-01
Layout and Task Aware Instruction Prompt for Zero-shot Document Image Question Answering Code
#10DocFormerv2-large
0.8784
ANLS· Extra Data· 2023-06-02
DocFormerv2: Local Features for Document Understanding Code
#11UDOP (aux)
0.878
ANLS· Extra Data· 2022-12-05
Unifying Vision, Text, and Layout for Universal Document Processing Code
#12PaLI-3
0.876
ANLS· 2023-10-13
PaLI-3 Vision Language Models: Smaller, Faster, Stronger Code
#13TILT-Large
0.8705
ANLS· Extra Data· 2021-02-18
Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer Code
#14PaLI-X (Single-task FT w/ OCR)
0.868
ANLS· Extra Data· 2023-05-29
PaLI-X: On Scaling up a Multilingual Vision and Language Model Code
#15LayoutLMv2LARGE
0.8672
ANLS· 2020-12-29
LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding Code
#16ERNIE-Layout large
0.8486
ANLS· 2022-10-12
ERNIE-Layout: Layout Knowledge Enhanced Pre-training for Visually-rich Document Understanding Code
#17UDOP
0.847
ANLS· 2022-12-05
Unifying Vision, Text, and Layout for Universal Document Processing Code
#18TILT-Base
0.8392
ANLS· Extra Data· 2021-02-18
Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer Code
#19Claude + LATIN-Prompt
0.8336
ANLS· 2023-06-01
Layout and Task Aware Instruction Prompt for Zero-shot Document Image Question Answering Code
#20GPT-3.5 + LATIN-Prompt
0.8255
ANLS· 2023-06-01
Layout and Task Aware Instruction Prompt for Zero-shot Document Image Question Answering Code
#21PaLI-X (Multi-task FT)
0.809
ANLS· Extra Data· 2023-05-29
PaLI-X: On Scaling up a Multilingual Vision and Language Model Code
#22DUBLIN (variable resolution)
0.803
ANLS· Extra Data· 2023-05-23
DUBLIN -- Document Understanding By Language-Image Network
#23PaLI-X (Single-task FT)
0.8
ANLS· Extra Data· 2023-05-29
PaLI-X: On Scaling up a Multilingual Vision and Language Model Code
#24DUBLIN
0.782
ANLS· Extra Data· 2023-05-23
DUBLIN -- Document Understanding By Language-Image Network
#25LayoutLMv2BASE
0.7808
ANLS· 2020-12-29
LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding Code
#26Pix2Struct-large
0.766
ANLS· 2022-10-07
Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding Code
#27MatCha
0.742
ANLS· 2022-12-19
MatCha: Enhancing Visual Language Pretraining with Math Reasoning and Chart Derendering Code
#28Pix2Struct-base
0.721
ANLS· 2022-10-07
Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding Code
#29Donut
0.675
ANLS· 2021-11-30
OCR-free Document Understanding Transformer Code
#30BERT_LARGE_SQUAD_DOCVQA_FINETUNED_Baseline
0.665
ANLS· Extra Data· 2020-07-01
DocVQA: A Dataset for VQA on Document Images Code
#31Qwen-VL
0.651
ANLS· Extra Data· 2023-08-24
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond Code
#32Dessurt
0.632
ANLS· 2022-03-30
End-to-end Document Recognition and Understanding with Dessurt Code
#33Qwen-VL-Chat
0.626
ANLS· Extra Data· 2023-08-24
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond Code