NAMEXTEND
This dataset extends NAMEXACT by including words that can be used as names, but may not exclusively be used as names in every context.
Dataset Details
Dataset Description
<!-- Provide a longer summary of what this dataset is. -->Unlike NAMEXACT, this datasets contains words that are mostly used as names, but may also be used in other contexts, such as
- Christian (believer in Christianity)
- Drew (simple past of the verb to draw)
- Florence (an Italian city)
- Henry (the SI unit of inductance)
- Mercedes (a car brand)
In addition, names with ambiguous gender are included - once for each gender. For instance, Skyler is included as female (F) name with a probability of 37.3%, and as male (M) name with a probability of 62.7%.
Dataset Sources [optional]
<!-- Provide the basic links for the dataset. -->-
Repository: github.com/aieng-lab/gradiend
-
Original Dataset: Gender by Name
Dataset Structure
name: the namegender: the gender of the name (Mfor male andFfor female)count: the count value of this name (raw value from the original dataset)probability: the probability of this name (raw value from original dataset; not normalized to this dataset!)gender_agreement: a value describing the certainty that this name has an unambiguous gender computed as the maximum probability of that name across both genders, e.g., for Skyler. For names with a uniquegenderin this dataset, this value is 1.0primary_gender: is equal togenderfor names with a unique gender in this dataset, and equals otherwise the gender of that name with higher probabilitygenders: labelBif both genders are contained for this name in this dataset, otherwise equal togenderprob_F: the probability of that name being used as a female name (i.e., 0.0 or 1.0 ifgenders!=B)prob_M: the probability of that name being used as a male name
Dataset Creation
Source Data
<!-- This section describes the source data (e.g. news text and headlines, social media posts, translated sentences, ...). -->The data is created by filtering Gender by Name.
Data Collection and Processing
<!-- This section describes the data collection and processing process such as data selection criteria, filtering and normalization methods, tools and libraries used, etc. -->The original data is filtered to contain only names with a count of at least 100 to remove very rare names. This threshold reduces the total number of names by $72%, from 133910 to 37425.
Bias, Risks, and Limitations
<!-- This section is meant to convey both technical and sociotechnical limitations. -->The original dataset provides counts of names (with their gender) for male and female babies from open-source government authorities in the US (1880-2019), UK (2011-2018), Canada (2011-2018), and Australia (1944-2019) in these periods