Dataset Card for Dataset Name

This dataset is a filtered version of BookCorpus containing only gender-neutral words.

geneutral = load_dataset('aieng-lab/geneutral', trust_remote_code=True, split='train')

Examples:

Index | Text ------|----- 8498 | no one sitting near me could tell that i was seething with rage . 8500 | by now everyone knew we were an item , the thirty-five year old business mogul , and the twenty -three year old pop sensation . 8501 | we 'd been able to keep our affair hidden for all of two months and that only because of my high security . 8503 | i was n't too worried about it , i just do n't like my personal life splashed across the headlines , but i guess it came with the territory . 8507 | i 'd sat there prepared to be bored out of my mind for the next two hours or so . 8508 | i 've seen and had my fair share of models over the years , and they no longer appealed . 8512 | when i finally looked up at the stage , my breath had got caught in my lungs . 8516 | i pulled my phone and cancelled my dinner date and essentially ended the six-month relationship i 'd been barely having with another woman . 8518 | when i see something that i want , i go after it . 8529 | if i had anything to say about that , it would be a permanent thing , or until i 'd had my fill at least .

Dataset Details

Repository: github.com/aieng-lab/gradiend
Original Data: BookCorpus

NOTE: This dataset is derived from BookCorpus, for which we do not have publication rights. Therefore, this repository only provides indices referring to gender-neutral entries within the BookCorpus dataset on Hugging Face. By using load_dataset('aieng-lab/geneutral', trust_remote_code=True, split='train'), both the indices and the full BookCorpus dataset are downloaded locally. The indices are then used to construct the GENEUTRAL dataset. The initial dataset generation takes a few minutes, but subsequent loads are cached for faster access.

Uses

This dataset is suitable for training and evaluating language models. For example, its lack of gender-related words makes it ideal for assessing language modeling capabilities in both gender-biased and gender-neutral models during masked language modeling (MLM) tasks, allowing for an evaluation independent of gender bias.

Dataset Creation

We generated this dataset by filtering the BookCorpus dataset, leaving only entries matching the following criteria:

Each entry contains at least 50 characters
No name of aieng-lab/namextend
No gender-specific pronoun is contained (he/she/him/her/his/hers/himself/herself)
No gender-specific noun is contained according to the 2421 plural-extended entries of this gendered-word dataset