GENEUTRAL

Textscc-by-4.0Introduced 2025-02-03

Dataset Card for Dataset Name

<!-- Provide a quick summary of the dataset. -->

This dataset is a filtered version of BookCorpus containing only gender-neutral words.

geneutral = load_dataset('aieng-lab/geneutral', trust_remote_code=True, split='train')

Examples:

Index | Text ------|----- 8498 | no one sitting near me could tell that i was seething with rage . 8500 | by now everyone knew we were an item , the thirty-five year old business mogul , and the twenty -three year old pop sensation . 8501 | we 'd been able to keep our affair hidden for all of two months and that only because of my high security . 8503 | i was n't too worried about it , i just do n't like my personal life splashed across the headlines , but i guess it came with the territory . 8507 | i 'd sat there prepared to be bored out of my mind for the next two hours or so . 8508 | i 've seen and had my fair share of models over the years , and they no longer appealed . 8512 | when i finally looked up at the stage , my breath had got caught in my lungs . 8516 | i pulled my phone and cancelled my dinner date and essentially ended the six-month relationship i 'd been barely having with another woman . 8518 | when i see something that i want , i go after it . 8529 | if i had anything to say about that , it would be a permanent thing , or until i 'd had my fill at least .

Dataset Details

<!-- Provide the basic links for the dataset. -->

NOTE: This dataset is derived from BookCorpus, for which we do not have publication rights. Therefore, this repository only provides indices referring to gender-neutral entries within the BookCorpus dataset on Hugging Face. By using load_dataset('aieng-lab/geneutral', trust_remote_code=True, split='train'), both the indices and the full BookCorpus dataset are downloaded locally. The indices are then used to construct the GENEUTRAL dataset. The initial dataset generation takes a few minutes, but subsequent loads are cached for faster access.

Uses

<!-- Address questions around how the dataset is intended to be used. -->

This dataset is suitable for training and evaluating language models. For example, its lack of gender-related words makes it ideal for assessing language modeling capabilities in both gender-biased and gender-neutral models during masked language modeling (MLM) tasks, allowing for an evaluation independent of gender bias.

Dataset Creation

We generated this dataset by filtering the BookCorpus dataset, leaving only entries matching the following criteria:

  • Each entry contains at least 50 characters
  • No name of aieng-lab/namextend
  • No gender-specific pronoun is contained (he/she/him/her/his/hers/himself/herself)
  • No gender-specific noun is contained according to the 2421 plural-extended entries of this gendered-word dataset