TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

SotA/Natural Language Processing/Visual Question Answering (VQA)/DocVQA test

Visual Question Answering (VQA) on DocVQA test

Metric: ANLS (higher is better)

LeaderboardDataset
Loading chart...

Results

Submit a result
#Model↕ANLS▼Extra DataPaperDate↕Code
1Human0.9436YesDocVQA: A Dataset for VQA on Document Images2020-07-01Code
2MLCD-Embodied-7B0.916YesMulti-label Cluster Discrimination for Visual Re...2024-07-24Code
3SMoLA-PaLI-X Specialist0.908YesOmni-SMoLA: Boosting Generalist Multimodal Model...2023-12-01-
4SMoLA-PaLI-X Generalist0.906YesOmni-SMoLA: Boosting Generalist Multimodal Model...2023-12-01-
5Qwen-VL-Plus0.9024YesQwen-VL: A Versatile Vision-Language Model for U...2023-08-24Code
6ScreenAI 5B (4.62 B params, w/OCR)0.8988YesScreenAI: A Vision-Language Model for UI and Inf...2024-02-07Code
7PaLI-3 (w/ OCR)0.886NoPaLI-3 Vision Language Models: Smaller, Faster, ...2023-10-13Code
8ERNIE-Layout large (ensemble)0.8841NoERNIE-Layout: Layout Knowledge Enhanced Pre-trai...2022-10-12Code
9GPT-40.884NoLayout and Task Aware Instruction Prompt for Zer...2023-06-01Code
10DocFormerv2-large0.8784YesDocFormerv2: Local Features for Document Underst...2023-06-02Code
11UDOP (aux)0.878YesUnifying Vision, Text, and Layout for Universal ...2022-12-05Code
12PaLI-30.876NoPaLI-3 Vision Language Models: Smaller, Faster, ...2023-10-13Code
13TILT-Large0.8705YesGoing Full-TILT Boogie on Document Understanding...2021-02-18Code
14PaLI-X (Single-task FT w/ OCR)0.868YesPaLI-X: On Scaling up a Multilingual Vision and ...2023-05-29Code
15LayoutLMv2LARGE0.8672NoLayoutLMv2: Multi-modal Pre-training for Visuall...2020-12-29Code
16ERNIE-Layout large0.8486NoERNIE-Layout: Layout Knowledge Enhanced Pre-trai...2022-10-12Code
17UDOP0.847NoUnifying Vision, Text, and Layout for Universal ...2022-12-05Code
18TILT-Base0.8392YesGoing Full-TILT Boogie on Document Understanding...2021-02-18Code
19Claude + LATIN-Prompt0.8336NoLayout and Task Aware Instruction Prompt for Zer...2023-06-01Code
20GPT-3.5 + LATIN-Prompt0.8255NoLayout and Task Aware Instruction Prompt for Zer...2023-06-01Code
21PaLI-X (Multi-task FT)0.809YesPaLI-X: On Scaling up a Multilingual Vision and ...2023-05-29Code
22DUBLIN (variable resolution)0.803YesDUBLIN -- Document Understanding By Language-Ima...2023-05-23-
23PaLI-X (Single-task FT)0.8YesPaLI-X: On Scaling up a Multilingual Vision and ...2023-05-29Code
24DUBLIN0.782YesDUBLIN -- Document Understanding By Language-Ima...2023-05-23-
25LayoutLMv2BASE0.7808NoLayoutLMv2: Multi-modal Pre-training for Visuall...2020-12-29Code
26Pix2Struct-large0.766NoPix2Struct: Screenshot Parsing as Pretraining fo...2022-10-07Code
27MatCha0.742NoMatCha: Enhancing Visual Language Pretraining wi...2022-12-19Code
28Pix2Struct-base0.721NoPix2Struct: Screenshot Parsing as Pretraining fo...2022-10-07Code
29Donut0.675NoOCR-free Document Understanding Transformer2021-11-30Code
30BERT_LARGE_SQUAD_DOCVQA_FINETUNED_Baseline0.665YesDocVQA: A Dataset for VQA on Document Images2020-07-01Code
31Qwen-VL0.651YesQwen-VL: A Versatile Vision-Language Model for U...2023-08-24Code
32Dessurt0.632NoEnd-to-end Document Recognition and Understandin...2022-03-30Code
33Qwen-VL-Chat0.626YesQwen-VL: A Versatile Vision-Language Model for U...2023-08-24Code