HaluEval

Introduced 2023-05-19

HaluEval is a large-scale hallucination evaluation benchmark designed for Large Language Models (LLMs). It provides a comprehensive collection of generated and human-annotated hallucinated samples to evaluate the performance of LLMs in recognizing hallucinations¹².

Here are the key details about the HaluEval dataset:

  1. Purpose and Overview:

    • Purpose: HaluEval aims to understand the types of content and the extent to which LLMs are prone to hallucinate.
    • Content: It includes both general user queries with ChatGPT responses and task-specific examples from three tasks: question answering, knowledge-grounded dialogue, and text summarization.
    • Data Sources:
      • For general user queries, HaluEval adopts the 52K instruction tuning dataset from Alpaca.
      • Task-specific examples are generated based on existing task datasets (e.g., HotpotQA, OpenDialKG, CNN/Daily Mail) as seed data.
  2. Data Composition:

    • General User Queries:
      • 5,000 user queries paired with ChatGPT responses.
      • Queries are selected based on low-similarity responses to identify potential hallucinations.
    • Task-Specific Examples:
      • 30,000 examples from three tasks:
        • Question Answering: Based on HotpotQA as seed data.
        • Knowledge-Grounded Dialogue: Based on OpenDialKG as seed data.
        • Text Summarization: Based on CNN/Daily Mail as seed data.
  3. Data Release:

    • The dataset contains 35,000 generated and human-annotated hallucinated samples used in experiments.
    • JSON files include:
      • qa_data.json: Hallucinated QA samples.
      • dialogue_data.json: Hallucinated dialogue samples.
      • summarization_data.json: Hallucinated summarization samples.
      • general_data.json: Human-annotated ChatGPT responses to general user queries.

Source: Conversation with Bing, 3/17/2024 (1) HaluEval: A Hallucination Evaluation Benchmark for LLMs. https://github.com/RUCAIBox/HaluEval. (2) jzjiao/halueval-sft · Datasets at Hugging Face. https://huggingface.co/datasets/jzjiao/halueval-sft. (3) HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large .... https://aclanthology.org/2023.emnlp-main.397/. (4) HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large .... https://arxiv.org/abs/2305.11747. (5) undefined. https://github.com/RUCAIBox/HaluEval%29.