TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Datasets/GENTER

GENTER

GEnder Name TEmplates with pRonouns

Textscc-by.4.0Introduced 2025-02-03

This dataset consists of template sentences associating first names ([NAME]) with third-person singular pronouns ([PRONOUN]), e.g.,

[NAME] asked , not sounding as if [PRONOUN] cared about the answer .
after all , [NAME] was the same as [PRONOUN] 'd always been .
there were moments when [NAME] was soft , when [PRONOUN] seemed more like the person [PRONOUN] had been .

Usage

genter = load_dataset('aieng-lab/genter', trust_remote_code=True, split=split)

split can be either train, val, test, or all.

Dataset Details

Dataset Description

<!-- Provide a longer summary of what this dataset is. -->

This dataset is a filtered version of BookCorpus containing only sentences where a first name is followed by its correct third-person singular pronoun (he/she). Based on these sentences, template sentences (masked) are created including two template keys: [NAME] and [PRONOUN]. Thus, this dataset can be used to generate various sentences with varying names (e.g., from aieng-lab/namexact) and filling in the correct pronoun for this name.

This dataset is a filtered version of BookCorpus that includes only sentences where a first name appears alongside its correct third-person singular pronoun (he/she).

From these sentences, template-based sentences (masked) are created with two template keys: [NAME] and [PRONOUN]. This design allows the dataset to generate diverse sentences by varying the names (e.g., using names from aieng-lab/namexact) and inserting the appropriate pronoun for each name.

Dataset Sources

<!-- Provide the basic links for the dataset. -->
  • Repository: github.com/aieng-lab/gradiend
  • Original Data: BookCorpus

NOTE: This dataset is derived from BookCorpus, for which we do not have publication rights. Therefore, this repository only provides indices, names and pronouns referring to GENTER entries within the BookCorpus dataset on Hugging Face. By using load_dataset('aieng-lab/genter', trust_remote_code=True, split='all'), both the indices and the full BookCorpus dataset are downloaded locally. The indices are then used to construct the GENEUTRAL dataset. The initial dataset generation takes a few minutes, but subsequent loads are cached for faster access.

Dataset Structure

<!-- This section provides a description of the dataset fields, and additional information about the dataset structure such as criteria used to create the splits, relationships between data points, etc. -->
  • text: the original entry of BookCorpus
  • masked: the masked version of text, i.e., with template masks for the name ([NAME]) and the pronoun ([PRONOUN])
  • label: the gender of the original used name (F for female and M for male)
  • name: the original name in text that is masked in masked as [NAME]
  • pronoun: the original pronoun in text that is masked in masked as [PRONOUN] (he/ she)
  • pronoun_count: the number of occurrences of pronouns (typically 1, at most 4)
  • index: The index of text in BookCorpus

Examples:

index | text | masked | label | name | pronoun | pronoun_count ------|------|--------|-------|------|---------|-------------- 71130173 | jessica asked , not sounding as if she cared about the answer . | [NAME] asked , not sounding as if [PRONOUN] cared about the answer . | M | jessica | she | 1 17316262 | jeremy looked around and there were many people at the campsite ; then he looked down at the small keg . | [NAME] looked around and there were many people at the campsite ; then [PRONOUN] looked down at the small keg . | F | jeremy | he | 1 41606581 | tabitha did n't seem to notice as she swayed to the loud , thrashing music . | [NAME] did n't seem to notice as [PRONOUN] swayed to the loud , thrashing music . | M | tabitha | she | 1 52926749 | gerald could come in now , have a look if he wanted . | [NAME] could come in now , have a look if [PRONOUN] wanted . | F | gerald | he | 1 47875293 | chapter six as time went by , matthew found that he was no longer certain that he cared for journalism . | chapter six as time went by , [NAME] found that [PRONOUN] was no longer certain that [PRONOUN] cared for journalism . | F | matthew | he | 2 73605732 | liam tried to keep a straight face , but he could n't hold back a smile . | [NAME] tried to keep a straight face , but [PRONOUN] could n't hold back a smile . | F | liam | he | 1 31376791 | after all , ella was the same as she 'd always been . | after all , [NAME] was the same as [PRONOUN] 'd always been . | M | ella | she | 1 61942082 | seth shrugs as he hops off the bed and lands on the floor with a thud . | [NAME] shrugs as [PRONOUN] hops off the bed and lands on the floor with a thud . | F | seth | he | 1 68696573 | graham 's eyes meet mine , but i 'm sure there 's no way he remembers what he promised me several hours ago until he stands , stretching . | [NAME] 's eyes meet mine , but i 'm sure there 's no way [PRONOUN] remembers what [PRONOUN] promised me several hours ago until [PRONOUN] stands , stretching . | F | graham | he | 3 28923447 | grief tore through me-the kind i had n't known would be possible to feel again , because i had felt this when i 'd held caleb as he died . | grief tore through me-the kind i had n't known would be possible to feel again , because i had felt this when i 'd held [NAME] as [PRONOUN] died . | F | caleb | he | 1

Dataset Creation

Curation Rationale

<!-- Motivation for the creation of this dataset. -->

For the training of a gender bias GRADIEND model, a diverse dataset associating first names with both, its factual and counterfactual pronoun associations, to assess gender-related gradient information.

Source Data

<!-- This section describes the source data (e.g. news text and headlines, social media posts, translated sentences, ...). -->

The dataset is derived from BookCorpus by filtering it and extracting the template structure.

<!-- BookCorpus is a dataset originally collected for academic purposes. The original dataset does not have an explicit license and was constructed by scraping publicly available books from the web. Use of this dataset is limited to **non-commercial research purposes only**. Proper attribution to the original BookCorpus creators is required. -->

We selected BookCorpus as foundational dataset due to its focus on fictional narratives where characters are often referred to by their first names. In contrast, the English Wikipedia, also commonly used for the training of transformer models, was less suitable for our purposes. For instance, sentences like [NAME] Jackson was a musician, [PRONOUN] was a great singer may be biased towards the name Michael.

Data Collection and Processing

<!-- This section describes the data collection and processing process such as data selection criteria, filtering and normalization methods, tools and libraries used, etc. -->

We filter the entries of BookCorpus and include only sentences that meet the following criteria:

  • Each sentence contains at least 50 characters
  • Exactly one name of aieng-lab/namexact is contained, ensuringa correct name match.
  • No other names from a larger name dataset (aieng-lab/namextend) are included, ensuring that only a single name appears in the sentence.
  • The correct name's gender-specific third-person pronoun (he or she) is included at least once.
  • All occurrences of the pronoun appear after the name in the sentence.
  • The counterfactual pronoun does not appear in the sentence.
  • The sentence excludes gender-specific reflexive pronouns (himself, herself) and possesive pronouns (his, her, him, hers)
  • Gendered nouns (e.g., actor, actress, ...) are excluded, based on a gemdered-word dataset with 2421 entries.

This approach generated a total of 83772 sentences. To further enhance data quality, we employed s imple BERT model (bert-base-uncased) as a judge model. This model must predict the correct pronoun for selected names with high certainty, otherwise, sentences may contain noise or ambiguous terms not caught by the initial filtering. Specifically, we used 50 female and 50 male names from the (aieng-lab/namextend) train split, and a correct prediction means the correct pronoun token is predicted as the token with the highest probability in the induced Masked Language Modeling (MLM) task. Only sentences for which the judge model correctly predicts the pronoun for every test case were retrained, resulting in a total of 27031 sentences.

The data is split into training (87.5%), validation (2.5%) and test (10%) subsets.

Bias, Risks, and Limitations

<!-- This section is meant to convey both technical and sociotechnical limitations. -->

Due to BookCorpus, only lower-case sentences are contained.

Statistics

Papers
1
Benchmarks
0

Links

Homepage

Tasks

Masked Language Modeling