TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

SotA/Natural Language Processing/Visual Question Answering (VQA)/InfographicVQA

Visual Question Answering (VQA) on InfographicVQA

Metric: ANLS (higher is better)

LeaderboardDataset
Loading chart...

Results

Submit a result
#Model↕ANLS▼Extra DataPaperDate↕Code
1Gemini Ultra (pixel only)80.3NoGemini: A Family of Highly Capable Multimodal Mo...2023-12-19Code
2SMoLA-PaLI-X Specialist66.2YesOmni-SMoLA: Boosting Generalist Multimodal Model...2023-12-01-
3ScreenAI 5B (4.62 B params, w/ OCR)65.9YesScreenAI: A Vision-Language Model for UI and Inf...2024-02-07Code
4SMoLA-PaLI-X Generalist65.6YesOmni-SMoLA: Boosting Generalist Multimodal Model...2023-12-01-
5UDOP (aux)63YesUnifying Vision, Text, and Layout for Universal ...2022-12-05Code
6PaLI-3 (w/ OCR)62.4NoPaLI-3 Vision Language Models: Smaller, Faster, ...2023-10-13Code
7TILT-Large61.2YesGoing Full-TILT Boogie on Document Understanding...2021-02-18Code
8PaLI-357.8NoPaLI-3 Vision Language Models: Smaller, Faster, ...2023-10-13Code
9ChatGPT 3.5 with LAPDoc Prompt (SpatialFormat)54.9NoLAPDoc: Layout-Aware Prompting for Documents2024-02-15-
10PaLI-X (Single-task FT w/ OCR)54.8YesPaLI-X: On Scaling up a Multilingual Vision and ...2023-05-29Code
11Claude + LATIN-Prompt54.51NoLayout and Task Aware Instruction Prompt for Zer...2023-06-01Code
12PaLI-X (Multi-task FT)50.7YesPaLI-X: On Scaling up a Multilingual Vision and ...2023-05-29Code
13PaLI-X (Single-task FT)49.2YesPaLI-X: On Scaling up a Multilingual Vision and ...2023-05-29Code
14GPT-3.5 + LATIN-Prompt48.98NoLayout and Task Aware Instruction Prompt for Zer...2023-06-01Code
15DocFormerv2-large48.8YesDocFormerv2: Local Features for Document Underst...2023-06-02Code
16UDOP47.4NoUnifying Vision, Text, and Layout for Universal ...2022-12-05Code
17DUBLIN (variable resolution)42.6YesDUBLIN -- Document Understanding By Language-Ima...2023-05-23-
18Pix2Struct-large40NoPix2Struct: Screenshot Parsing as Pretraining fo...2022-10-07Code
19Pix2Struct-base38.2NoPix2Struct: Screenshot Parsing as Pretraining fo...2022-10-07Code
20MatCha37.2NoMatCha: Enhancing Visual Language Pretraining wi...2022-12-19Code
21DUBLIN36.82YesDUBLIN -- Document Understanding By Language-Ima...2023-05-23-