Titanic

Titanic - Machine Learning from Disaster

TabularCC BY-SA 3.0Introduced 2012-09-01

Titanic Dataset Description

Overview

The data is divided into two groups:

  • Training set (train.csv):
    Used to build machine learning models. It includes the outcome (also called the "ground truth") for each passenger, allowing models to predict survival based on “features” like gender and class. Feature engineering can also be applied to create new features.
  • Test set (test.csv):
    Used to evaluate model performance on unseen data. The ground truth is not provided; the task is to predict survival for each passenger in the test set using the trained model.

Additionally, gender_submission.csv is provided as an example submission file, containing predictions based on the assumption that all and only female passengers survive.


Data Dictionary

| Variable | Definition | Key | |------------|------------------------------------------|-------------------------------------------------| | survival | Survival | 0 = No, 1 = Yes | | pclass | Ticket class | 1 = 1st, 2 = 2nd, 3 = 3rd | | sex | Sex | | | age | Age in years | | | sibsp | # of siblings/spouses aboard the Titanic | | | parch | # of parents/children aboard the Titanic | | | ticket | Ticket number | | | fare | Passenger fare | | | cabin | Cabin number | | | embarked | Port of Embarkation | C = Cherbourg, Q = Queenstown, S = Southampton |


Variable Notes

  • pclass: Proxy for socio-economic status (SES):
    • 1st = Upper
    • 2nd = Middle
    • 3rd = Lower
  • age:
    • Fractional if less than 1 year.
    • Estimated ages are represented in the form xx.5.
  • sibsp: Defines family relations as:
    • Sibling: Brother, sister, stepbrother, stepsister.
    • Spouse: Husband, wife (excluding mistresses and fiancés).
  • parch: Defines family relations as:
    • Parent: Mother, father.
    • Child: Daughter, son, stepdaughter, stepson.
      Some children traveled only with a nanny, so parch = 0 for them.