TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/MatCha: Enhancing Visual Language Pretraining with Math Re...

MatCha: Enhancing Visual Language Pretraining with Math Reasoning and Chart Derendering

Fangyu Liu, Francesco Piccinno, Syrine Krichene, Chenxi Pang, Kenton Lee, Mandar Joshi, Yasemin Altun, Nigel Collier, Julian Martin Eisenschlos

2022-12-19MathChart Question AnsweringImage to textData SummarizationVisual Question Answering (VQA)Language ModellingVisual Question Answering
PaperPDFCode

Abstract

Visual language data such as plots, charts, and infographics are ubiquitous in the human world. However, state-of-the-art vision-language models do not perform well on these data. We propose MatCha (Math reasoning and Chart derendering pretraining) to enhance visual language models' capabilities in jointly modeling charts/plots and language data. Specifically, we propose several pretraining tasks that cover plot deconstruction and numerical reasoning which are the key capabilities in visual language modeling. We perform the MatCha pretraining starting from Pix2Struct, a recently proposed image-to-text visual language model. On standard benchmarks such as PlotQA and ChartQA, the MatCha model outperforms state-of-the-art methods by as much as nearly 20%. We also examine how well MatCha pretraining transfers to domains such as screenshots, textbook diagrams, and document figures and observe overall improvement, verifying the usefulness of MatCha pretraining on broader visual language tasks.

Results

TaskDatasetMetricValueModel
Visual Question Answering (VQA)DocVQA testANLS0.742MatCha
Visual Question Answering (VQA)InfographicVQAANLS37.2MatCha
Visual Question Answering (VQA)PlotQA-D21:1 Accuracy90.7MatCha
Visual Question Answering (VQA)PlotQA-D11:1 Accuracy92.3MatCha
Visual Question Answering (VQA)PlotQA1:1 Accuracy91.5MatCha
Visual Question Answering (VQA)RealCQA1:1 Accuracy0.259728175283818Matcha-chartQA
Visual Question Answering (VQA)ChartQA1:1 Accuracy64.2MatCha
Chart Question AnsweringPlotQA1:1 Accuracy91.5MatCha
Chart Question AnsweringRealCQA1:1 Accuracy0.259728175283818Matcha-chartQA
Chart Question AnsweringChartQA1:1 Accuracy64.2MatCha
Visual Question AnsweringPlotQA-D21:1 Accuracy90.7MatCha
Visual Question AnsweringPlotQA-D11:1 Accuracy92.3MatCha

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21VAR-MATH: Probing True Mathematical Reasoning in Large Language Models via Symbolic Multi-Instance Benchmarks2025-07-17QuestA: Expanding Reasoning Capacity in LLMs via Question Augmentation2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17Making Language Model a Hierarchical Classifier and Generator2025-07-17The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations2025-07-17Inverse Reinforcement Learning Meets Large Language Model Post-Training: Basics, Advances, and Opportunities2025-07-17Scaling Up RL: Unlocking Diverse Reasoning in LLMs via Prolonged Training2025-07-16