Visual Question Answering (VQA) on InfographicVQA

Metric: ANLS (higher is better)

LeaderboardDataset

Loading chart...

Results

Hide extra data

Sort:

#	Model↕	ANLS▼	Extra Data	Paper	Date↕	Code
1	Gemini Ultra (pixel only)	80.3	No	Gemini: A Family of Highly Capable Multimodal Mo...	2023-12-19	Code
2	SMoLA-PaLI-X Specialist	66.2	Yes	Omni-SMoLA: Boosting Generalist Multimodal Model...	2023-12-01	-
3	ScreenAI 5B (4.62 B params, w/ OCR)	65.9	Yes	ScreenAI: A Vision-Language Model for UI and Inf...	2024-02-07	Code
4	SMoLA-PaLI-X Generalist	65.6	Yes	Omni-SMoLA: Boosting Generalist Multimodal Model...	2023-12-01	-
5	UDOP (aux)	63	Yes	Unifying Vision, Text, and Layout for Universal ...	2022-12-05	Code
6	PaLI-3 (w/ OCR)	62.4	No	PaLI-3 Vision Language Models: Smaller, Faster, ...	2023-10-13	Code
7	TILT-Large	61.2	Yes	Going Full-TILT Boogie on Document Understanding...	2021-02-18	Code
8	PaLI-3	57.8	No	PaLI-3 Vision Language Models: Smaller, Faster, ...	2023-10-13	Code
9	ChatGPT 3.5 with LAPDoc Prompt (SpatialFormat)	54.9	No	LAPDoc: Layout-Aware Prompting for Documents	2024-02-15	-
10	PaLI-X (Single-task FT w/ OCR)	54.8	Yes	PaLI-X: On Scaling up a Multilingual Vision and ...	2023-05-29	Code
11	Claude + LATIN-Prompt	54.51	No	Layout and Task Aware Instruction Prompt for Zer...	2023-06-01	Code
12	PaLI-X (Multi-task FT)	50.7	Yes	PaLI-X: On Scaling up a Multilingual Vision and ...	2023-05-29	Code
13	PaLI-X (Single-task FT)	49.2	Yes	PaLI-X: On Scaling up a Multilingual Vision and ...	2023-05-29	Code
14	GPT-3.5 + LATIN-Prompt	48.98	No	Layout and Task Aware Instruction Prompt for Zer...	2023-06-01	Code
15	DocFormerv2-large	48.8	Yes	DocFormerv2: Local Features for Document Underst...	2023-06-02	Code
16	UDOP	47.4	No	Unifying Vision, Text, and Layout for Universal ...	2022-12-05	Code
17	DUBLIN (variable resolution)	42.6	Yes	DUBLIN -- Document Understanding By Language-Ima...	2023-05-23	-
18	Pix2Struct-large	40	No	Pix2Struct: Screenshot Parsing as Pretraining fo...	2022-10-07	Code
19	Pix2Struct-base	38.2	No	Pix2Struct: Screenshot Parsing as Pretraining fo...	2022-10-07	Code
20	MatCha	37.2	No	MatCha: Enhancing Visual Language Pretraining wi...	2022-12-19	Code
21	DUBLIN	36.82	Yes	DUBLIN -- Document Understanding By Language-Ima...	2023-05-23	-

#1Gemini Ultra (pixel only)SOTA
80.3
ANLS· 2023-12-19
Gemini: A Family of Highly Capable Multimodal Models Code
#2SMoLA-PaLI-X SpecialistSOTA
66.2
ANLS· Extra Data· 2023-12-01
Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of Low-rank Experts
#3ScreenAI 5B (4.62 B params, w/ OCR)
65.9
ANLS· Extra Data· 2024-02-07
ScreenAI: A Vision-Language Model for UI and Infographics Understanding Code
#4SMoLA-PaLI-X Generalist
65.6
ANLS· Extra Data· 2023-12-01
Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of Low-rank Experts
#5UDOP (aux)SOTA
63
ANLS· Extra Data· 2022-12-05
Unifying Vision, Text, and Layout for Universal Document Processing Code
#6PaLI-3 (w/ OCR)
62.4
ANLS· 2023-10-13
PaLI-3 Vision Language Models: Smaller, Faster, Stronger Code
#7TILT-LargeSOTA
61.2
ANLS· Extra Data· 2021-02-18
Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer Code
#8PaLI-3
57.8
ANLS· 2023-10-13
PaLI-3 Vision Language Models: Smaller, Faster, Stronger Code
#9ChatGPT 3.5 with LAPDoc Prompt (SpatialFormat)
54.9
ANLS· 2024-02-15
LAPDoc: Layout-Aware Prompting for Documents
#10PaLI-X (Single-task FT w/ OCR)
54.8
ANLS· Extra Data· 2023-05-29
PaLI-X: On Scaling up a Multilingual Vision and Language Model Code
#11Claude + LATIN-Prompt
54.51
ANLS· 2023-06-01
Layout and Task Aware Instruction Prompt for Zero-shot Document Image Question Answering Code
#12PaLI-X (Multi-task FT)
50.7
ANLS· Extra Data· 2023-05-29
PaLI-X: On Scaling up a Multilingual Vision and Language Model Code
#13PaLI-X (Single-task FT)
49.2
ANLS· Extra Data· 2023-05-29
PaLI-X: On Scaling up a Multilingual Vision and Language Model Code
#14GPT-3.5 + LATIN-Prompt
48.98
ANLS· 2023-06-01
Layout and Task Aware Instruction Prompt for Zero-shot Document Image Question Answering Code
#15DocFormerv2-large
48.8
ANLS· Extra Data· 2023-06-02
DocFormerv2: Local Features for Document Understanding Code
#16UDOP
47.4
ANLS· 2022-12-05
Unifying Vision, Text, and Layout for Universal Document Processing Code
#17DUBLIN (variable resolution)
42.6
ANLS· Extra Data· 2023-05-23
DUBLIN -- Document Understanding By Language-Image Network
#18Pix2Struct-large
40
ANLS· 2022-10-07
Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding Code
#19Pix2Struct-base
38.2
ANLS· 2022-10-07
Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding Code
#20MatCha
37.2
ANLS· 2022-12-19
MatCha: Enhancing Visual Language Pretraining with Math Reasoning and Chart Derendering Code
#21DUBLIN
36.82
ANLS· Extra Data· 2023-05-23
DUBLIN -- Document Understanding By Language-Image Network