LLM Health Benchmarks
LLM Health Benchmarks - Yesil Science
LLM Health Benchmarks Dataset
The Health Benchmarks Dataset is a specialized resource for evaluating large language models (LLMs) in different medical specialties. It provides structured question-answer pairs designed to test the performance of AI models in understanding and generating domain-specific knowledge.
Primary Purpose
This dataset is built to:
- Benchmark LLMs in medical specialties and subfields.
- Assess the accuracy and contextual understanding of AI in healthcare.
- Serve as a standardized evaluation suite for AI systems designed for medical applications.
Key Features
- Covers 50+ medical and health-related topics, including both clinical and non-clinical domains.
- Includes ~7,500 structured question-answer pairs.
- Designed for fine-grained performance evaluation in medical specialties.
Applications
- LLM Evaluation: Benchmarking AI models for domain-specific performance.
- Healthcare AI Research: Standardized testing for AI in healthcare.
- Medical Education AI: Testing AI systems designed for tutoring medical students.
Dataset Structure
The dataset is organized by medical specialties and subfields, each represented as a split. Below is a snapshot:
| Specialty | Number of Rows | |-----------------------------|--------------------| | Lab Medicine | 158 | | Ethics | 174 | | Dermatology | 170 | | Gastroenterology | 163 | | Internal Medicine | 178 | | Oncology | 180 | | Orthopedics | 177 | | General Surgery | 178 | | Pediatrics | 180 | | ...(and more) | ... |
Each split contains:
Questions: The medical questions for the specialty.Answers: Corresponding high-quality answers.
Usage Instructions
Here’s how you can load and use the dataset:
from datasets import load_dataset
## Load the dataset
dataset = load_dataset("yesilhealth/Health_Benchmarks")
## Access specific specialty splits
oncology = dataset["Oncology"]
internal_medicine = dataset["Internal_Medicine"]
## View sample data
print(oncology[:5])
Evaluation Workflow
- Model Input: Provide the questions from each split to the LLM.
- Model Output: Collect the AI-generated answers.
- Scoring: Compare model answers to ground truth answers using metrics such as:
- Exact Match (EM)
- F1 Score
- Semantic Similarity
Citation
If you use this dataset for research or development, please cite:
@dataset{yesilhealth_health_benchmarks,
title={Health Benchmarks Dataset},
author={Yesil Health AI},
year={2024},
url={https://huggingface.co/datasets/yesilhealth/Health_Benchmarks}
}
License
This dataset is licensed under the Apache 2.0 License.
Feedback
For questions, suggestions, or feedback, feel free to contact us via email at [hello@yesilhealth.com].