HaluEval
HaluEval is a large-scale hallucination evaluation benchmark designed for Large Language Models (LLMs). It provides a comprehensive collection of generated and human-annotated hallucinated samples to evaluate the performance of LLMs in recognizing hallucinations¹².
Here are the key details about the HaluEval dataset:
-
Purpose and Overview:
- Purpose: HaluEval aims to understand the types of content and the extent to which LLMs are prone to hallucinate.
- Content: It includes both general user queries with ChatGPT responses and task-specific examples from three tasks: question answering, knowledge-grounded dialogue, and text summarization.
- Data Sources:
- For general user queries, HaluEval adopts the 52K instruction tuning dataset from Alpaca.
- Task-specific examples are generated based on existing task datasets (e.g., HotpotQA, OpenDialKG, CNN/Daily Mail) as seed data.
-
Data Composition:
- General User Queries:
- 5,000 user queries paired with ChatGPT responses.
- Queries are selected based on low-similarity responses to identify potential hallucinations.
- Task-Specific Examples:
- 30,000 examples from three tasks:
- Question Answering: Based on HotpotQA as seed data.
- Knowledge-Grounded Dialogue: Based on OpenDialKG as seed data.
- Text Summarization: Based on CNN/Daily Mail as seed data.
- 30,000 examples from three tasks:
- General User Queries:
-
Data Release:
- The dataset contains 35,000 generated and human-annotated hallucinated samples used in experiments.
- JSON files include:
qa_data.json: Hallucinated QA samples.dialogue_data.json: Hallucinated dialogue samples.summarization_data.json: Hallucinated summarization samples.general_data.json: Human-annotated ChatGPT responses to general user queries.
Source: Conversation with Bing, 3/17/2024 (1) HaluEval: A Hallucination Evaluation Benchmark for LLMs. https://github.com/RUCAIBox/HaluEval. (2) jzjiao/halueval-sft · Datasets at Hugging Face. https://huggingface.co/datasets/jzjiao/halueval-sft. (3) HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large .... https://aclanthology.org/2023.emnlp-main.397/. (4) HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large .... https://arxiv.org/abs/2305.11747. (5) undefined. https://github.com/RUCAIBox/HaluEval%29.