Papers With Code 2 | ML Benchmarks, SotA Results & Code

HaluEval is a large-scale hallucination evaluation benchmark designed for Large Language Models (LLMs). It provides a comprehensive collection of generated and human-annotated hallucinated samples to evaluate the performance of LLMs in recognizing hallucinations¹².

Here are the key details about the HaluEval dataset:

Purpose and Overview:
- Purpose: HaluEval aims to understand the types of content and the extent to which LLMs are prone to hallucinate.
- Content: It includes both general user queries with ChatGPT responses and task-specific examples from three tasks: question answering, knowledge-grounded dialogue, and text summarization.
- Data Sources:
  - For general user queries, HaluEval adopts the 52K instruction tuning dataset from Alpaca.
  - Task-specific examples are generated based on existing task datasets (e.g., HotpotQA, OpenDialKG, CNN/Daily Mail) as seed data.
Data Composition:
- General User Queries:
  - 5,000 user queries paired with ChatGPT responses.
  - Queries are selected based on low-similarity responses to identify potential hallucinations.
- Task-Specific Examples:
  - 30,000 examples from three tasks:
    - Question Answering: Based on HotpotQA as seed data.
    - Knowledge-Grounded Dialogue: Based on OpenDialKG as seed data.
    - Text Summarization: Based on CNN/Daily Mail as seed data.
Data Release:
- The dataset contains 35,000 generated and human-annotated hallucinated samples used in experiments.
- JSON files include:
  - qa_data.json: Hallucinated QA samples.
  - dialogue_data.json: Hallucinated dialogue samples.
  - summarization_data.json: Hallucinated summarization samples.
  - general_data.json: Human-annotated ChatGPT responses to general user queries.

Source: Conversation with Bing, 3/17/2024 (1) HaluEval: A Hallucination Evaluation Benchmark for LLMs. https://github.com/RUCAIBox/HaluEval. (2) jzjiao/halueval-sft · Datasets at Hugging Face. https://huggingface.co/datasets/jzjiao/halueval-sft. (3) HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large .... https://aclanthology.org/2023.emnlp-main.397/. (4) HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large .... https://arxiv.org/abs/2305.11747. (5) undefined. https://github.com/RUCAIBox/HaluEval%29.