TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

SotA/Natural Language Processing/Question Answering/TruthfulQA

Question Answering on TruthfulQA

Metric: MC1 (higher is better)

LeaderboardDataset
Loading chart...

Results

Submit a result
#Model↕MC1▼Extra DataPaperDate↕Code
1GPT-4 (RLHF)0.59NoGPT-4 Technical Report2023-03-15Code
2Mistral-7B-Instruct-v0.2 + TruthX0.56NoTruthX: Alleviating Hallucinations by Editing La...2024-02-27Code
3LLaMa-2-7B-Chat + TruthX0.54NoTruthX: Alleviating Hallucinations by Editing La...2024-02-27Code
4LLaMA-2-Chat-13B + Representation Control (Contrast Vector)0.54NoRepresentation Engineering: A Top-Down Approach ...2023-10-02Code
5LLaMA-2-Chat-7B + Representation Control (Contrast Vector)0.48NoRepresentation Engineering: A Top-Down Approach ...2023-10-02Code
6Vicuna 7B + Inference Time Intervention (ITI)0.389No---
7Alpaca 7B + Inference Time Intervention (ITI)0.319No---
8Gopher 280B (zero-shot, Our Prompt + Choices)0.295NoScaling Language Models: Methods, Analysis & Ins...2021-12-08Code
9LLaMA 7B + Inference Time Intervention (ITI)0.288No---
10GAL 120B0.26NoGalactica: A Large Language Model for Science2022-11-16Code
11Gopher 7.1 (zero-shot, QA prompts)0.25NoScaling Language Models: Methods, Analysis & Ins...2021-12-08Code
12GAL 30B0.24NoGalactica: A Large Language Model for Science2022-11-16Code
13Gopher 7.1B (zero-shot, Our Prompt + Choices)0.23NoScaling Language Models: Methods, Analysis & Ins...2021-12-08Code
14Gopher 1.4 (zero-shot, QA prompts)0.23NoScaling Language Models: Methods, Analysis & Ins...2021-12-08Code
15GPT-2 1.5B0.22NoTruthfulQA: Measuring How Models Mimic Human Fal...2021-09-08Code
16Gopher 1.4B (zero-shot, Our Prompt + Choices)0.217NoScaling Language Models: Methods, Analysis & Ins...2021-12-08Code
17GPT-3 175B0.21NoTruthfulQA: Measuring How Models Mimic Human Fal...2021-09-08Code
18OPT 175B0.21NoGalactica: A Large Language Model for Science2022-11-16Code
19GPT-J 6B0.2NoTruthfulQA: Measuring How Models Mimic Human Fal...2021-09-08Code
20UnifiedQA 3B0.19NoTruthfulQA: Measuring How Models Mimic Human Fal...2021-09-08Code
21GAL 125M0.19NoGalactica: A Large Language Model for Science2022-11-16Code
22GAL 1.3B0.19NoGalactica: A Large Language Model for Science2022-11-16Code
23GAL 6.7B0.19NoGalactica: A Large Language Model for Science2022-11-16Code