Papers With Code 2 | ML Benchmarks, SotA Results & Code

This dataset consists of template sentences associating first names ([NAME]) with third-person singular pronouns ([PRONOUN]), e.g.,

[NAME] asked , not sounding as if [PRONOUN] cared about the answer .
after all , [NAME] was the same as [PRONOUN] 'd always been .
there were moments when [NAME] was soft , when [PRONOUN] seemed more like the person [PRONOUN] had been .

Usage

genter = load_dataset('aieng-lab/genter', trust_remote_code=True, split=split)

split can be either train, val, test, or all.

Dataset Details

Dataset Description

This dataset is a filtered version of BookCorpus containing only sentences where a first name is followed by its correct third-person singular pronoun (he/she). Based on these sentences, template sentences (masked) are created including two template keys: [NAME] and [PRONOUN]. Thus, this dataset can be used to generate various sentences with varying names (e.g., from aieng-lab/namexact) and filling in the correct pronoun for this name.

This dataset is a filtered version of BookCorpus that includes only sentences where a first name appears alongside its correct third-person singular pronoun (he/she).

From these sentences, template-based sentences (masked) are created with two template keys: [NAME] and [PRONOUN]. This design allows the dataset to generate diverse sentences by varying the names (e.g., using names from aieng-lab/namexact) and inserting the appropriate pronoun for each name.

Dataset Sources

Repository: github.com/aieng-lab/gradiend
Original Data: BookCorpus

NOTE: This dataset is derived from BookCorpus, for which we do not have publication rights. Therefore, this repository only provides indices, names and pronouns referring to GENTER entries within the BookCorpus dataset on Hugging Face. By using load_dataset('aieng-lab/genter', trust_remote_code=True, split='all'), both the indices and the full BookCorpus dataset are downloaded locally. The indices are then used to construct the GENEUTRAL dataset. The initial dataset generation takes a few minutes, but subsequent loads are cached for faster access.

Dataset Structure

text: the original entry of BookCorpus
masked: the masked version of text, i.e., with template masks for the name ([NAME]) and the pronoun ([PRONOUN])
label: the gender of the original used name (F for female and M for male)
name: the original name in text that is masked in masked as [NAME]
pronoun: the original pronoun in text that is masked in masked as [PRONOUN] (he/ she)
pronoun_count: the number of occurrences of pronouns (typically 1, at most 4)
index: The index of text in BookCorpus

Examples:

index | text | masked | label | name | pronoun | pronoun_count ------|------|--------|-------|------|---------|-------------- 71130173 | jessica asked , not sounding as if she cared about the answer . | [NAME] asked , not sounding as if [PRONOUN] cared about the answer . | M | jessica | she | 1 17316262 | jeremy looked around and there were many people at the campsite ; then he looked down at the small keg . | [NAME] looked around and there were many people at the campsite ; then [PRONOUN] looked down at the small keg . | F | jeremy | he | 1 41606581 | tabitha did n't seem to notice as she swayed to the loud , thrashing music . | [NAME] did n't seem to notice as [PRONOUN] swayed to the loud , thrashing music . | M | tabitha | she | 1 52926749 | gerald could come in now , have a look if he wanted . | [NAME] could come in now , have a look if [PRONOUN] wanted . | F | gerald | he | 1 47875293 | chapter six as time went by , matthew found that he was no longer certain that he cared for journalism . | chapter six as time went by , [NAME] found that [PRONOUN] was no longer certain that [PRONOUN] cared for journalism . | F | matthew | he | 2 73605732 | liam tried to keep a straight face , but he could n't hold back a smile . | [NAME] tried to keep a straight face , but [PRONOUN] could n't hold back a smile . | F | liam | he | 1 31376791 | after all , ella was the same as she 'd always been . | after all , [NAME] was the same as [PRONOUN] 'd always been . | M | ella | she | 1 61942082 | seth shrugs as he hops off the bed and lands on the floor with a thud . | [NAME] shrugs as [PRONOUN] hops off the bed and lands on the floor with a thud . | F | seth | he | 1 68696573 | graham 's eyes meet mine , but i 'm sure there 's no way he remembers what he promised me several hours ago until he stands , stretching . | [NAME] 's eyes meet mine , but i 'm sure there 's no way [PRONOUN] remembers what [PRONOUN] promised me several hours ago until [PRONOUN] stands , stretching . | F | graham | he | 3 28923447 | grief tore through me-the kind i had n't known would be possible to feel again , because i had felt this when i 'd held caleb as he died . | grief tore through me-the kind i had n't known would be possible to feel again , because i had felt this when i 'd held [NAME] as [PRONOUN] died . | F | caleb | he | 1

Dataset Creation

Curation Rationale

For the training of a gender bias GRADIEND model, a diverse dataset associating first names with both, its factual and counterfactual pronoun associations, to assess gender-related gradient information.

Source Data

The dataset is derived from BookCorpus by filtering it and extracting the template structure.

We selected BookCorpus as foundational dataset due to its focus on fictional narratives where characters are often referred to by their first names. In contrast, the English Wikipedia, also commonly used for the training of transformer models, was less suitable for our purposes. For instance, sentences like [NAME] Jackson was a musician, [PRONOUN] was a great singer may be biased towards the name Michael.

Data Collection and Processing

We filter the entries of BookCorpus and include only sentences that meet the following criteria:

Each sentence contains at least 50 characters
Exactly one name of aieng-lab/namexact is contained, ensuringa correct name match.
No other names from a larger name dataset (aieng-lab/namextend) are included, ensuring that only a single name appears in the sentence.
The correct name's gender-specific third-person pronoun (he or she) is included at least once.
All occurrences of the pronoun appear after the name in the sentence.
The counterfactual pronoun does not appear in the sentence.
The sentence excludes gender-specific reflexive pronouns (himself, herself) and possesive pronouns (his, her, him, hers)
Gendered nouns (e.g., actor, actress, ...) are excluded, based on a gemdered-word dataset with 2421 entries.

This approach generated a total of 83772 sentences. To further enhance data quality, we employed s imple BERT model (bert-base-uncased) as a judge model. This model must predict the correct pronoun for selected names with high certainty, otherwise, sentences may contain noise or ambiguous terms not caught by the initial filtering. Specifically, we used 50 female and 50 male names from the (aieng-lab/namextend) train split, and a correct prediction means the correct pronoun token is predicted as the token with the highest probability in the induced Masked Language Modeling (MLM) task. Only sentences for which the judge model correctly predicts the pronoun for every test case were retrained, resulting in a total of 27031 sentences.

The data is split into training (87.5%), validation (2.5%) and test (10%) subsets.

Bias, Risks, and Limitations

Due to BookCorpus, only lower-case sentences are contained.

This dataset consists of template sentences associating first names ([NAME]) with third-person singular pronouns ([PRONOUN]), e.g.,

[NAME] asked , not sounding as if [PRONOUN] cared about the answer .
after all , [NAME] was the same as [PRONOUN] 'd always been .
there were moments when [NAME] was soft , when [PRONOUN] seemed more like the person [PRONOUN] had been .

Usage

genter = load_dataset('aieng-lab/genter', trust_remote_code=True, split=split)

split can be either train, val, test, or all.

Dataset Details

Dataset Description

This dataset is a filtered version of BookCorpus that includes only sentences where a first name appears alongside its correct third-person singular pronoun (he/she).

Dataset Sources

Repository: github.com/aieng-lab/gradiend
Original Data: BookCorpus

NOTE: This dataset is derived from BookCorpus, for which we do not have publication rights. Therefore, this repository only provides indices, names and pronouns referring to GENTER entries within the BookCorpus dataset on Hugging Face. By using load_dataset('aieng-lab/genter', trust_remote_code=True, split='all'), both the indices and the full BookCorpus dataset are downloaded locally. The indices are then used to construct the GENEUTRAL dataset. The initial dataset generation takes a few minutes, but subsequent loads are cached for faster access.

Dataset Structure

text: the original entry of BookCorpus
masked: the masked version of text, i.e., with template masks for the name ([NAME]) and the pronoun ([PRONOUN])
label: the gender of the original used name (F for female and M for male)
name: the original name in text that is masked in masked as [NAME]
pronoun: the original pronoun in text that is masked in masked as [PRONOUN] (he/ she)
pronoun_count: the number of occurrences of pronouns (typically 1, at most 4)
index: The index of text in BookCorpus

Examples:

Dataset Creation

Curation Rationale

Source Data

The dataset is derived from BookCorpus by filtering it and extracting the template structure.

Data Collection and Processing

We filter the entries of BookCorpus and include only sentences that meet the following criteria:

Each sentence contains at least 50 characters
Exactly one name of aieng-lab/namexact is contained, ensuringa correct name match.
No other names from a larger name dataset (aieng-lab/namextend) are included, ensuring that only a single name appears in the sentence.
The correct name's gender-specific third-person pronoun (he or she) is included at least once.
All occurrences of the pronoun appear after the name in the sentence.
The counterfactual pronoun does not appear in the sentence.
The sentence excludes gender-specific reflexive pronouns (himself, herself) and possesive pronouns (his, her, him, hers)
Gendered nouns (e.g., actor, actress, ...) are excluded, based on a gemdered-word dataset with 2421 entries.

The data is split into training (87.5%), validation (2.5%) and test (10%) subsets.

Bias, Risks, and Limitations

Due to BookCorpus, only lower-case sentences are contained.