Papers With Code 2 | ML Benchmarks, SotA Results & Code

This dataset extends NAMEXACT by including words that can be used as names, but may not exclusively be used as names in every context.

Dataset Details

Dataset Description

Unlike NAMEXACT, this datasets contains words that are mostly used as names, but may also be used in other contexts, such as

Christian (believer in Christianity)
Drew (simple past of the verb to draw)
Florence (an Italian city)
Henry (the SI unit of inductance)
Mercedes (a car brand)

In addition, names with ambiguous gender are included - once for each gender. For instance, Skyler is included as female (F) name with a probability of 37.3%, and as male (M) name with a probability of 62.7%.

Dataset Sources [optional]

Repository: github.com/aieng-lab/gradiend
Original Dataset: Gender by Name

Dataset Structure

name: the name
gender: the gender of the name (M for male and F for female)
count: the count value of this name (raw value from the original dataset)
probability: the probability of this name (raw value from original dataset; not normalized to this dataset!)
gender_agreement: a value describing the certainty that this name has an unambiguous gender computed as the maximum probability of that name across both genders, e.g., $max(37.7%, 62.7%)=62.7%$ for Skyler. For names with a unique gender in this dataset, this value is 1.0
primary_gender: is equal to gender for names with a unique gender in this dataset, and equals otherwise the gender of that name with higher probability
genders: label B if both genders are contained for this name in this dataset, otherwise equal to gender
prob_F: the probability of that name being used as a female name (i.e., 0.0 or 1.0 if genders != B)
prob_M: the probability of that name being used as a male name

Dataset Creation

Source Data

The data is created by filtering Gender by Name.

Data Collection and Processing

The original data is filtered to contain only names with a count of at least 100 to remove very rare names. This threshold reduces the total number of names by $72%, from 133910 to 37425.

Bias, Risks, and Limitations

The original dataset provides counts of names (with their gender) for male and female babies from open-source government authorities in the US (1880-2019), UK (2011-2018), Canada (2011-2018), and Australia (1944-2019) in these periods