TruthX: Alleviating Hallucinations by Editing Large Language Models in Truthful Space

Shaolei Zhang, Tian Yu, Yang Feng

2024-02-27Question Answering Text Generation TruthfulQA Hallucination Hallucination Evaluation Large Language Model Contrastive Learning Language Modelling

Paper PDF Code(official)

Abstract

Large Language Models (LLMs) sometimes suffer from producing hallucinations, especially LLMs may generate untruthful responses despite knowing the correct knowledge. Activating the truthfulness within LLM is the key to fully unlocking LLM's knowledge potential. In this paper, we propose TruthX, an inference-time intervention method to activate the truthfulness of LLM by identifying and editing the features within LLM's internal representations that govern the truthfulness. TruthX employs an auto-encoder to map LLM's representations into semantic and truthful latent spaces respectively, and applies contrastive learning to identify a truthful editing direction within the truthful space. During inference, by editing LLM's internal representations in truthful space, TruthX effectively enhances the truthfulness of LLM. Experiments show that TruthX improves the truthfulness of 13 advanced LLMs by an average of 20% on TruthfulQA benchmark. Further analyses suggest that TruthX can control LLM to produce truthful or hallucinatory responses via editing only one vector in LLM's internal representations.

Results

Task	Dataset	Metric	Value	Model
Question Answering	TruthfulQA	MC1	0.56	Mistral-7B-Instruct-v0.2 + TruthX
Question Answering	TruthfulQA	MC2	0.75	Mistral-7B-Instruct-v0.2 + TruthX
Question Answering	TruthfulQA	MC1	0.54	LLaMa-2-7B-Chat + TruthX
Question Answering	TruthfulQA	MC2	0.74	LLaMa-2-7B-Chat + TruthX

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21 DENSE: Longitudinal Progress Note Generation with Temporal Modeling of Heterogeneous Clinical Notes Across Hospital Visits2025-07-18 From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17 Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17 Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17 City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17 Making Language Model a Hierarchical Classifier and Generator2025-07-17 GeoReg: Weight-Constrained Few-Shot Regression for Socio-Economic Estimation using LLM2025-07-17