SPIQA Dataset Card

Dataset Details

Dataset Name: SPIQA (Scientific Paper Image Question Answering)

Paper: SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers

Github: SPIQA eval and metrics code repo

Dataset Summary: SPIQA is a large-scale and challenging QA dataset focused on figures, tables, and text paragraphs from scientific research papers in various computer science domains. The figures cover a wide variety of plots, charts, schematic diagrams, result visualization etc. The dataset is the result of a meticulous curation process, leveraging the breadth of expertise and ability of multimodal large language models (MLLMs) to understand figures. We employ both automatic and manual curation to ensure the highest level of quality and reliability. SPIQA consists of more than 270K questions divided into training, validation, and three different evaluation splits. The purpose of the dataset is to evaluate the ability of Large Multimodal Models to comprehend complex figures and tables with the textual paragraphs of scientific papers.

This Data Card describes the structure of the SPIQA dataset, divided into training, validation, and three different evaluation splits. The test-B and test-C splits are filtered from the QASA and QASPER datasets and contain human-written QAs. We collect all scientific papers published at top computer science conferences between 2018 and 2023 from arXiv.

If you have any comments or questions, reach out to Shraman Pramanick or Subhashini Venugopalan.

Supported Tasks:

Direct QA with figures and tables
Direct QA with full paper
CoT QA (retrieval of helpful figures, tables; then answering)

Language: English

Release Date: SPIQA is released in June 2024.

Data Splits

The statistics of different splits of SPIQA is shown below.

| <center>Split</center> | <center>Papers</center> | <center>Questions</center> | <center>Schematics</center> | <center>Plots & Charts</center> | <center>Visualizations</center> | <center>Other figures</center> | <center>Tables</center> | |--------|----------|---------|--------|----------------|-------|--------|----------| | <center>Train</center> | <center>25,459</center> | <center>262,524</center> | <center>44,008</center> | <center>70,041</center> | <center>27,297</center>| <center>6,450</center> | <center>114,728</center> | | <center>Val</center> | <center>200</center> | <center>2,085</center> | <center>360</center> | <center>582</center> | <center>173</center> | <center>55</center> | <center>915</center> | | <center>test-A</center> | <center>118</center> | <center>666</center> | <center>154</center> | <center>301</center> | <center>131</center> | <center>95</center> | <center>434</center> | | <center>test-B</center> | <center>65</center> | <center>228</center> | <center>147</center> | <center>156</center> | <center>133</center> | <center>17</center> | <center>341</center> | | <center>test-C</center> | <center>314</center> | <center>493</center> | <center>415</center> | <center>404</center> | <center>26</center> | <center>66</center> | <center>1,332</center> |

Dataset Structure

The contents of this dataset card are structured as follows:

SPIQA
    ├── SPIQA_train_val_test-A_extracted_paragraphs.zip
        ├── Extracted textual paragraphs from the papers in SPIQA train, val and test-A splits
    ├── SPIQA_train_val_test-A_raw_tex.zip
        └── The raw tex files from the papers in SPIQA train, val and test-A splits. These files are not required to reproduce our results; we open-source them for future research.
    ├── train_val
        ├── SPIQA_train_val_Images.zip
            └── Full resolution figures and tables from the papers in SPIQA train, val splits
        ├── SPIQA_train.json
            └── SPIQA train metadata
        ├── SPIQA_val.json
            └── SPIQA val metadata
    ├── test-A
        ├── SPIQA_testA_Images.zip
            └── Full resolution figures and tables from the papers in SPIQA test-A split
        ├── SPIQA_testA_Images_224px.zip
            └── 224px figures and tables from the papers in SPIQA test-A split
        ├── SPIQA_testA.json
            └── SPIQA test-A metadata
    ├── test-B
        ├── SPIQA_testB_Images.zip
            └── Full resolution figures and tables from the papers in SPIQA test-B split
        ├── SPIQA_testB_Images_224px.zip
            └── 224px figures and tables from the papers in SPIQA test-B split
        ├── SPIQA_testB.json
            └── SPIQA test-B metadata
    ├── test-C
        ├── SPIQA_testC_Images.zip
            └── Full resolution figures and tables from the papers in SPIQA test-C split
        ├── SPIQA_testC_Images_224px.zip
            └── 224px figures and tables from the papers in SPIQA test-C split
        ├── SPIQA_testC.json
            └── SPIQA test-C metadata

The testA_data_viewer.json file is only for viewing a portion of the data on HuggingFace viewer to get a quick sense of the metadata.

Metadata Structure

The metadata for every split is provided as dictionary where the keys are arXiv IDs of the papers. The primary contents of each dictionary item are:

arXiv ID
Semantic scholar ID (for test-B)
Figures and tables
- Name of the png file
- Caption
- Content type (figure or table)
- Figure type (schematic, plot, photo (visualization), others)
QAs
- Question, answer and rationale
- Reference figures and tables
- Textual evidence (for test-B and test-C)
Abstract and full paper text (for test-B and test-C; full paper for other splits are provided as a zip)

Dataset Use and Starter Snippets

Downloading the Dataset to Local

We recommend the users to download the metadata and images to their local machine.

Download the whole dataset (all splits).

from huggingface_hub import snapshot_download
snapshot_download(repo_id="google/spiqa", repo_type="dataset", local_dir='.') ### Mention the local directory path

Download specific file.

from huggingface_hub import hf_hub_download
hf_hub_download(repo_id="google/spiqa", filename="test-A/SPIQA_testA.json", repo_type="dataset", local_dir='.') ### Mention the local directory path

Questions and Answers from a Specific Paper in test-A

import json
testA_metadata = json.load(open('test-A/SPIQA_testA.json', 'r'))
paper_id = '1702.03584v3'
print(testA_metadata[paper_id]['qa'])

Questions and Answers from a Specific Paper in test-B

import json
testB_metadata = json.load(open('test-B/SPIQA_testB.json', 'r'))
paper_id = '1707.07012'
print(testB_metadata[paper_id]['question']) ## Questions
print(testB_metadata[paper_id]['composition']) ## Answers

Questions and Answers from a Specific Paper in test-C

import json
testC_metadata = json.load(open('test-C/SPIQA_testC.json', 'r'))
paper_id = '1808.08780'
print(testC_metadata[paper_id]['question']) ## Questions
print(testC_metadata[paper_id]['answer']) ## Answers

Annotation Overview

Questions and answers for the SPIQA train, validation, and test-A sets were machine-generated. Additionally, the SPIQA test-A set was manually filtered and curated. Questions in the SPIQA test-B set are collected from the QASA dataset, while those in the SPIQA test-C set are from the QASPER dataset. Answering the questions in all splits requires holistic understanding of figures and tables with related text from the scientific papers.

Personal and Sensitive Information

We are not aware of any personal or sensitive information in the dataset.

Licensing Information

CC BY 4.0

Citation Information

@article{pramanick2024spiqa,
  title={SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers},
  author={Pramanick, Shraman and Chellappa, Rama and Venugopalan, Subhashini},
  journal={NeurIPS},
  year={2024} 
}

SPIQA Dataset Card

Dataset Details

Dataset Name: SPIQA (Scientific Paper Image Question Answering)

Paper: SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers

Github: SPIQA eval and metrics code repo

If you have any comments or questions, reach out to Shraman Pramanick or Subhashini Venugopalan.

Supported Tasks:

Direct QA with figures and tables
Direct QA with full paper
CoT QA (retrieval of helpful figures, tables; then answering)

Language: English

Release Date: SPIQA is released in June 2024.

Data Splits

The statistics of different splits of SPIQA is shown below.

Dataset Structure

The contents of this dataset card are structured as follows:

SPIQA
    ├── SPIQA_train_val_test-A_extracted_paragraphs.zip
        ├── Extracted textual paragraphs from the papers in SPIQA train, val and test-A splits
    ├── SPIQA_train_val_test-A_raw_tex.zip
        └── The raw tex files from the papers in SPIQA train, val and test-A splits. These files are not required to reproduce our results; we open-source them for future research.
    ├── train_val
        ├── SPIQA_train_val_Images.zip
            └── Full resolution figures and tables from the papers in SPIQA train, val splits
        ├── SPIQA_train.json
            └── SPIQA train metadata
        ├── SPIQA_val.json
            └── SPIQA val metadata
    ├── test-A
        ├── SPIQA_testA_Images.zip
            └── Full resolution figures and tables from the papers in SPIQA test-A split
        ├── SPIQA_testA_Images_224px.zip
            └── 224px figures and tables from the papers in SPIQA test-A split
        ├── SPIQA_testA.json
            └── SPIQA test-A metadata
    ├── test-B
        ├── SPIQA_testB_Images.zip
            └── Full resolution figures and tables from the papers in SPIQA test-B split
        ├── SPIQA_testB_Images_224px.zip
            └── 224px figures and tables from the papers in SPIQA test-B split
        ├── SPIQA_testB.json
            └── SPIQA test-B metadata
    ├── test-C
        ├── SPIQA_testC_Images.zip
            └── Full resolution figures and tables from the papers in SPIQA test-C split
        ├── SPIQA_testC_Images_224px.zip
            └── 224px figures and tables from the papers in SPIQA test-C split
        ├── SPIQA_testC.json
            └── SPIQA test-C metadata

The testA_data_viewer.json file is only for viewing a portion of the data on HuggingFace viewer to get a quick sense of the metadata.

Metadata Structure

The metadata for every split is provided as dictionary where the keys are arXiv IDs of the papers. The primary contents of each dictionary item are:

arXiv ID
Semantic scholar ID (for test-B)
Figures and tables
- Name of the png file
- Caption
- Content type (figure or table)
- Figure type (schematic, plot, photo (visualization), others)
QAs
- Question, answer and rationale
- Reference figures and tables
- Textual evidence (for test-B and test-C)
Abstract and full paper text (for test-B and test-C; full paper for other splits are provided as a zip)

Dataset Use and Starter Snippets

Downloading the Dataset to Local

We recommend the users to download the metadata and images to their local machine.

Download the whole dataset (all splits).

from huggingface_hub import snapshot_download
snapshot_download(repo_id="google/spiqa", repo_type="dataset", local_dir='.') ### Mention the local directory path

Download specific file.

from huggingface_hub import hf_hub_download
hf_hub_download(repo_id="google/spiqa", filename="test-A/SPIQA_testA.json", repo_type="dataset", local_dir='.') ### Mention the local directory path

Questions and Answers from a Specific Paper in test-A

import json
testA_metadata = json.load(open('test-A/SPIQA_testA.json', 'r'))
paper_id = '1702.03584v3'
print(testA_metadata[paper_id]['qa'])

Questions and Answers from a Specific Paper in test-B

import json
testB_metadata = json.load(open('test-B/SPIQA_testB.json', 'r'))
paper_id = '1707.07012'
print(testB_metadata[paper_id]['question']) ## Questions
print(testB_metadata[paper_id]['composition']) ## Answers

Questions and Answers from a Specific Paper in test-C

import json
testC_metadata = json.load(open('test-C/SPIQA_testC.json', 'r'))
paper_id = '1808.08780'
print(testC_metadata[paper_id]['question']) ## Questions
print(testC_metadata[paper_id]['answer']) ## Answers

Annotation Overview

Personal and Sensitive Information

We are not aware of any personal or sensitive information in the dataset.

Licensing Information

CC BY 4.0

Citation Information

@article{pramanick2024spiqa,
  title={SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers},
  author={Pramanick, Shraman and Chellappa, Rama and Venugopalan, Subhashini},
  journal={NeurIPS},
  year={2024} 
}