PQAref

Pubmed Question Answering with references

TextsAGPLv3Introduced 2024-07-06

The PQAref dataset is a dataset for fine-tuning large language models for referenced question-answering in biomedical domain.

The dataset contains 3 components:

Instruction - question that is supposed to be answered Abstracts - set of 10 relevant abstracts retrieved from PubMed by an IR system. They contain the PubMed id, abstract title and the content of the abstract Answer - expected answer, with references in the form of PubMed IDs.

The dataset was created semi-automatically, utilizing questions available from PubMedQA dataset.

The dataset contains 9,075 samples, split into training, validation and test set in proportion 80%:10%:10%.