TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/StereoSet: Measuring stereotypical bias in pretrained lang...

StereoSet: Measuring stereotypical bias in pretrained language models

Moin Nadeem, Anna Bethke, Siva Reddy

2020-04-20ACL 2021 5MathBias Detection
PaperPDFCodeCode(official)Code

Abstract

A stereotype is an over-generalized belief about a particular group of people, e.g., Asians are good at math or Asians are bad drivers. Such beliefs (biases) are known to hurt target groups. Since pretrained language models are trained on large real world data, they are known to capture stereotypical biases. In order to assess the adverse effects of these models, it is important to quantify the bias captured in them. Existing literature on quantifying bias evaluates pretrained language models on a small set of artificially constructed bias-assessing sentences. We present StereoSet, a large-scale natural dataset in English to measure stereotypical biases in four domains: gender, profession, race, and religion. We evaluate popular models like BERT, GPT-2, RoBERTa, and XLNet on our dataset and show that these models exhibit strong stereotypical biases. We also present a leaderboard with a hidden test set to track the bias of future language models at https://stereoset.mit.edu

Results

TaskDatasetMetricValueModel
Bias DetectionStereoSetICAT Score72.97GPT-2 (small)
Bias DetectionStereoSetICAT Score72.03XLNet (large)
Bias DetectionStereoSetICAT Score71.73GPT-2 (medium)
Bias DetectionStereoSetICAT Score71.21BERT (base)
Bias DetectionStereoSetICAT Score70.54GPT-2 (large)
Bias DetectionStereoSetICAT Score69.89BERT (large)
Bias DetectionStereoSetICAT Score67.5RoBERTa (base)
Bias DetectionStereoSetICAT Score62.1XLNet (base)

Related Papers

VAR-MATH: Probing True Mathematical Reasoning in Large Language Models via Symbolic Multi-Instance Benchmarks2025-07-17QuestA: Expanding Reasoning Capacity in LLMs via Question Augmentation2025-07-17Scaling Up RL: Unlocking Diverse Reasoning in LLMs via Prolonged Training2025-07-16Temperature and Persona Shape LLM Agent Consensus With Minimal Accuracy Gains in Qualitative Coding2025-07-15Personalized Exercise Recommendation with Semantically-Grounded Knowledge Tracing2025-07-15Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination2025-07-14A Practical Two-Stage Recipe for Mathematical LLMs: Maximizing Accuracy with SFT and Efficiency with Reinforcement Learning2025-07-11Skip a Layer or Loop it? Test-Time Depth Adaptation of Pretrained LLMs2025-07-10