CLEAR-Bias

Corpus for Linguistic Evaluation of Adversarial Robustness against Bias

TextsIntroduced 2025-04-10

CLEAR-Bias is a benchmark dataset designed to evaluate the robustness of large language models (LLMs) against bias elicitation, particularly under adversarial conditions. It comprises 4,400 prompts across two task formats: multiple-choice and sentence completion. These prompts span seven core bias categories—age, disability, ethnicity, gender, religion, sexual orientation, and socioeconomic status—as well as three intersectional categories, enabling the exploration of overlapping social biases often overlooked in standard evaluations. Each category includes 20 carefully crafted base prompts (10 per task type), which are further expanded using seven jailbreak techniques: machine translation, obfuscation, prefix and prompt injection, refusal suppression, reward incentives, and role-playing—each implemented with three variants.

CLEAR-Bias is intended for researchers, developers, and practitioners seeking to assess or enhance the ethical behavior of language models. It serves as a benchmarking tool for measuring how effectively different models resist producing biased outputs in both standard and adversarial scenarios. By supporting the evaluation of ethical reliability before real-world deployment, CLEAR-Bias contributes to the development of safer and more responsible LLMs.