NAMEXTEND

Textscc-by.4.0Introduced 2025-02-03

This dataset extends NAMEXACT by including words that can be used as names, but may not exclusively be used as names in every context.

Dataset Details

Dataset Description

<!-- Provide a longer summary of what this dataset is. -->

Unlike NAMEXACT, this datasets contains words that are mostly used as names, but may also be used in other contexts, such as

  • Christian (believer in Christianity)
  • Drew (simple past of the verb to draw)
  • Florence (an Italian city)
  • Henry (the SI unit of inductance)
  • Mercedes (a car brand)

In addition, names with ambiguous gender are included - once for each gender. For instance, Skyler is included as female (F) name with a probability of 37.3%, and as male (M) name with a probability of 62.7%.

Dataset Sources [optional]

<!-- Provide the basic links for the dataset. -->

Dataset Structure

  • name: the name
  • gender: the gender of the name (M for male and F for female)
  • count: the count value of this name (raw value from the original dataset)
  • probability: the probability of this name (raw value from original dataset; not normalized to this dataset!)
  • gender_agreement: a value describing the certainty that this name has an unambiguous gender computed as the maximum probability of that name across both genders, e.g., max(37.7max(37.7%, 62.7%)=62.7% for Skyler. For names with a unique gender in this dataset, this value is 1.0
  • primary_gender: is equal to gender for names with a unique gender in this dataset, and equals otherwise the gender of that name with higher probability
  • genders: label B if both genders are contained for this name in this dataset, otherwise equal to gender
  • prob_F: the probability of that name being used as a female name (i.e., 0.0 or 1.0 if genders != B)
  • prob_M: the probability of that name being used as a male name

Dataset Creation

Source Data

<!-- This section describes the source data (e.g. news text and headlines, social media posts, translated sentences, ...). -->

The data is created by filtering Gender by Name.

Data Collection and Processing

<!-- This section describes the data collection and processing process such as data selection criteria, filtering and normalization methods, tools and libraries used, etc. -->

The original data is filtered to contain only names with a count of at least 100 to remove very rare names. This threshold reduces the total number of names by $72%, from 133910 to 37425.

Bias, Risks, and Limitations

<!-- This section is meant to convey both technical and sociotechnical limitations. -->

The original dataset provides counts of names (with their gender) for male and female babies from open-source government authorities in the US (1880-2019), UK (2011-2018), Canada (2011-2018), and Australia (1944-2019) in these periods