LLM Health Benchmarks

LLM Health Benchmarks - Yesil Science

MedicalTextsApache 2.0 LicenseIntroduced 2025-02-14

LLM Health Benchmarks Dataset

The Health Benchmarks Dataset is a specialized resource for evaluating large language models (LLMs) in different medical specialties. It provides structured question-answer pairs designed to test the performance of AI models in understanding and generating domain-specific knowledge.

Primary Purpose

This dataset is built to:

  • Benchmark LLMs in medical specialties and subfields.
  • Assess the accuracy and contextual understanding of AI in healthcare.
  • Serve as a standardized evaluation suite for AI systems designed for medical applications.

Key Features

  • Covers 50+ medical and health-related topics, including both clinical and non-clinical domains.
  • Includes ~7,500 structured question-answer pairs.
  • Designed for fine-grained performance evaluation in medical specialties.

Applications

  • LLM Evaluation: Benchmarking AI models for domain-specific performance.
  • Healthcare AI Research: Standardized testing for AI in healthcare.
  • Medical Education AI: Testing AI systems designed for tutoring medical students.

Dataset Structure

The dataset is organized by medical specialties and subfields, each represented as a split. Below is a snapshot:

| Specialty | Number of Rows | |-----------------------------|--------------------| | Lab Medicine | 158 | | Ethics | 174 | | Dermatology | 170 | | Gastroenterology | 163 | | Internal Medicine | 178 | | Oncology | 180 | | Orthopedics | 177 | | General Surgery | 178 | | Pediatrics | 180 | | ...(and more) | ... |

Each split contains:

  • Questions: The medical questions for the specialty.
  • Answers: Corresponding high-quality answers.

Usage Instructions

Here’s how you can load and use the dataset:

from datasets import load_dataset

## Load the dataset
dataset = load_dataset("yesilhealth/Health_Benchmarks")

## Access specific specialty splits
oncology = dataset["Oncology"]
internal_medicine = dataset["Internal_Medicine"]

## View sample data
print(oncology[:5])

Evaluation Workflow

  1. Model Input: Provide the questions from each split to the LLM.
  2. Model Output: Collect the AI-generated answers.
  3. Scoring: Compare model answers to ground truth answers using metrics such as:
    • Exact Match (EM)
    • F1 Score
    • Semantic Similarity

Citation

If you use this dataset for research or development, please cite:

@dataset{yesilhealth_health_benchmarks,
  title={Health Benchmarks Dataset},
  author={Yesil Health AI},
  year={2024},
  url={https://huggingface.co/datasets/yesilhealth/Health_Benchmarks}
}

License

This dataset is licensed under the Apache 2.0 License.

Feedback

For questions, suggestions, or feedback, feel free to contact us via email at [hello@yesilhealth.com].