Papers With Code 2 | ML Benchmarks, SotA Results & Code

LLM Health Benchmarks Dataset

The Health Benchmarks Dataset is a specialized resource for evaluating large language models (LLMs) in different medical specialties. It provides structured question-answer pairs designed to test the performance of AI models in understanding and generating domain-specific knowledge.

Primary Purpose

This dataset is built to:

Benchmark LLMs in medical specialties and subfields.
Assess the accuracy and contextual understanding of AI in healthcare.
Serve as a standardized evaluation suite for AI systems designed for medical applications.

Key Features

Covers 50+ medical and health-related topics, including both clinical and non-clinical domains.
Includes ~7,500 structured question-answer pairs.
Designed for fine-grained performance evaluation in medical specialties.

Applications

LLM Evaluation: Benchmarking AI models for domain-specific performance.
Healthcare AI Research: Standardized testing for AI in healthcare.
Medical Education AI: Testing AI systems designed for tutoring medical students.

Dataset Structure

The dataset is organized by medical specialties and subfields, each represented as a split. Below is a snapshot:

| Specialty | Number of Rows | |-----------------------------|--------------------| | Lab Medicine | 158 | | Ethics | 174 | | Dermatology | 170 | | Gastroenterology | 163 | | Internal Medicine | 178 | | Oncology | 180 | | Orthopedics | 177 | | General Surgery | 178 | | Pediatrics | 180 | | ...(and more) | ... |

Each split contains:

Questions: The medical questions for the specialty.
Answers: Corresponding high-quality answers.

Usage Instructions

Here’s how you can load and use the dataset:

from datasets import load_dataset

## Load the dataset
dataset = load_dataset("yesilhealth/Health_Benchmarks")

## Access specific specialty splits
oncology = dataset["Oncology"]
internal_medicine = dataset["Internal_Medicine"]

## View sample data
print(oncology[:5])

Evaluation Workflow

Model Input: Provide the questions from each split to the LLM.
Model Output: Collect the AI-generated answers.
Scoring: Compare model answers to ground truth answers using metrics such as:
- Exact Match (EM)
- F1 Score
- Semantic Similarity

Citation

If you use this dataset for research or development, please cite:

@dataset{yesilhealth_health_benchmarks,
  title={Health Benchmarks Dataset},
  author={Yesil Health AI},
  year={2024},
  url={https://huggingface.co/datasets/yesilhealth/Health_Benchmarks}
}

License

This dataset is licensed under the Apache 2.0 License.

Feedback

For questions, suggestions, or feedback, feel free to contact us via email at [hello@yesilhealth.com].