Dataset Name: SPIQA (Scientific Paper Image Question Answering)
Paper: SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers
Github: SPIQA eval and metrics code repo
Dataset Summary: SPIQA is a large-scale and challenging QA dataset focused on figures, tables, and text paragraphs from scientific research papers in various computer science domains. The figures cover a wide variety of plots, charts, schematic diagrams, result visualization etc. The dataset is the result of a meticulous curation process, leveraging the breadth of expertise and ability of multimodal large language models (MLLMs) to understand figures. We employ both automatic and manual curation to ensure the highest level of quality and reliability. SPIQA consists of more than 270K questions divided into training, validation, and three different evaluation splits. The purpose of the dataset is to evaluate the ability of Large Multimodal Models to comprehend complex figures and tables with the textual paragraphs of scientific papers.
This Data Card describes the structure of the SPIQA dataset, divided into training, validation, and three different evaluation splits. The test-B and test-C splits are filtered from the QASA and QASPER datasets and contain human-written QAs. We collect all scientific papers published at top computer science conferences between 2018 and 2023 from arXiv.
If you have any comments or questions, reach out to Shraman Pramanick or Subhashini Venugopalan.
Supported Tasks:
Language: English
Release Date: SPIQA is released in June 2024.
The statistics of different splits of SPIQA is shown below.
| <center>Split</center> | <center>Papers</center> | <center>Questions</center> | <center>Schematics</center> | <center>Plots & Charts</center> | <center>Visualizations</center> | <center>Other figures</center> | <center>Tables</center> | |--------|----------|---------|--------|----------------|-------|--------|----------| | <center>Train</center> | <center>25,459</center> | <center>262,524</center> | <center>44,008</center> | <center>70,041</center> | <center>27,297</center>| <center>6,450</center> | <center>114,728</center> | | <center>Val</center> | <center>200</center> | <center>2,085</center> | <center>360</center> | <center>582</center> | <center>173</center> | <center>55</center> | <center>915</center> | | <center>test-A</center> | <center>118</center> | <center>666</center> | <center>154</center> | <center>301</center> | <center>131</center> | <center>95</center> | <center>434</center> | | <center>test-B</center> | <center>65</center> | <center>228</center> | <center>147</center> | <center>156</center> | <center>133</center> | <center>17</center> | <center>341</center> | | <center>test-C</center> | <center>314</center> | <center>493</center> | <center>415</center> | <center>404</center> | <center>26</center> | <center>66</center> | <center>1,332</center> |
The contents of this dataset card are structured as follows:
SPIQA
├── SPIQA_train_val_test-A_extracted_paragraphs.zip
├── Extracted textual paragraphs from the papers in SPIQA train, val and test-A splits
├── SPIQA_train_val_test-A_raw_tex.zip
└── The raw tex files from the papers in SPIQA train, val and test-A splits. These files are not required to reproduce our results; we open-source them for future research.
├── train_val
├── SPIQA_train_val_Images.zip
└── Full resolution figures and tables from the papers in SPIQA train, val splits
├── SPIQA_train.json
└── SPIQA train metadata
├── SPIQA_val.json
└── SPIQA val metadata
├── test-A
├── SPIQA_testA_Images.zip
└── Full resolution figures and tables from the papers in SPIQA test-A split
├── SPIQA_testA_Images_224px.zip
└── 224px figures and tables from the papers in SPIQA test-A split
├── SPIQA_testA.json
└── SPIQA test-A metadata
├── test-B
├── SPIQA_testB_Images.zip
└── Full resolution figures and tables from the papers in SPIQA test-B split
├── SPIQA_testB_Images_224px.zip
└── 224px figures and tables from the papers in SPIQA test-B split
├── SPIQA_testB.json
└── SPIQA test-B metadata
├── test-C
├── SPIQA_testC_Images.zip
└── Full resolution figures and tables from the papers in SPIQA test-C split
├── SPIQA_testC_Images_224px.zip
└── 224px figures and tables from the papers in SPIQA test-C split
├── SPIQA_testC.json
└── SPIQA test-C metadata
The testA_data_viewer.json file is only for viewing a portion of the data on HuggingFace viewer to get a quick sense of the metadata.
The metadata for every split is provided as dictionary where the keys are arXiv IDs of the papers. The primary contents of each dictionary item are:
We recommend the users to download the metadata and images to their local machine.
from huggingface_hub import snapshot_download
snapshot_download(repo_id="google/spiqa", repo_type="dataset", local_dir='.') ### Mention the local directory path
from huggingface_hub import hf_hub_download
hf_hub_download(repo_id="google/spiqa", filename="test-A/SPIQA_testA.json", repo_type="dataset", local_dir='.') ### Mention the local directory path
import json
testA_metadata = json.load(open('test-A/SPIQA_testA.json', 'r'))
paper_id = '1702.03584v3'
print(testA_metadata[paper_id]['qa'])
import json
testB_metadata = json.load(open('test-B/SPIQA_testB.json', 'r'))
paper_id = '1707.07012'
print(testB_metadata[paper_id]['question']) ## Questions
print(testB_metadata[paper_id]['composition']) ## Answers
import json
testC_metadata = json.load(open('test-C/SPIQA_testC.json', 'r'))
paper_id = '1808.08780'
print(testC_metadata[paper_id]['question']) ## Questions
print(testC_metadata[paper_id]['answer']) ## Answers
Questions and answers for the SPIQA train, validation, and test-A sets were machine-generated. Additionally, the SPIQA test-A set was manually filtered and curated. Questions in the SPIQA test-B set are collected from the QASA dataset, while those in the SPIQA test-C set are from the QASPER dataset. Answering the questions in all splits requires holistic understanding of figures and tables with related text from the scientific papers.
We are not aware of any personal or sensitive information in the dataset.
CC BY 4.0
@article{pramanick2024spiqa,
title={SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers},
author={Pramanick, Shraman and Chellappa, Rama and Venugopalan, Subhashini},
journal={NeurIPS},
year={2024}
}