PQAref
Pubmed Question Answering with references
TextsAGPLv3Introduced 2024-07-06
The PQAref dataset is a dataset for fine-tuning large language models for referenced question-answering in biomedical domain.
The dataset contains 3 components:
Instruction - question that is supposed to be answered Abstracts - set of 10 relevant abstracts retrieved from PubMed by an IR system. They contain the PubMed id, abstract title and the content of the abstract Answer - expected answer, with references in the form of PubMed IDs.
The dataset was created semi-automatically, utilizing questions available from PubMedQA dataset.
The dataset contains 9,075 samples, split into training, validation and test set in proportion 80%:10%:10%.