PubMedQA corpus with metadata

TextsCC BY 4.0Introduced 2025-05-23

PubMedQA-MetaGen: Metadata-Enriched PubMedQA Corpus

Dataset Summary PubMedQA-MetaGen is a metadata-enriched version of the PubMedQA biomedical question-answering dataset, created using the MetaGenBlendedRAG enrichment pipeline. The dataset contains both the original and enriched versions of the corpus, enabling direct benchmarking of retrieval-augmented and semantic search approaches in biomedical NLP.

Files Provided PubMedQA_original_corpus.json This file contains the original PubMedQA corpus, formatted directly from the official PubMedQA dataset. Each record includes the biomedical question, context (abstract), and answer fields, mirroring the original dataset structure.

PubMedQA_corpus_with_metadata.json This file contains the metadata-enriched version, created by processing the original corpus through the MetaGenBlendedRAG pipeline. In addition to the original fields, each entry is augmented with structured metadata—including key concepts, MeSH terms, automatically generated keywords, extracted entities, and LLM-generated summaries—designed to support advanced retrieval and RAG research.

How to Use RAG evaluation: Benchmark your retrieval-augmented QA models using the enriched context for higher recall and precision. Semantic Search: Build improved biomedical search engines leveraging topic, entity, and keyword metadata. NLP & LLM Fine-tuning: Use for fine-tuning models that benefit from structured biomedical context. Dataset Structure Each sample contains:

Original fields: Question, context (abstract), answer

Enriched fields (in PubMedQA_corpus_with_metadata.json only):

Key concepts and topics Extracted MeSH terms and UMLS entities Automatically generated keywords Section/type labels LLM-generated summaries/metadata Document identifiers and links Dataset Creation Process Source: Original PubMedQA dataset. Metadata Enrichment: Applied the MetaGenBlendedRAG pipeline (rule-based, NLP, and LLM-driven enrichment). Outputs: Two files—original and enriched—supporting both traditional and metadata-driven research. Intended Use and Limitations For research and educational use in biomedical QA, RAG, semantic retrieval, and metadata enrichment evaluation. Note: Some metadata fields generated by LLMs may vary in quality; users should verify outputs for critical applications.