To consider whether Title is a significant predictor of “survival”:
The variable “Title” contains 17 levels, most of which have very low frequencies – see table below.
First step is to consolidate all levels with low frequencies into an “Other” level.
This gives us the following frequencies:
- Other – 25
- Master – 40
- Miss – 184
- Mr – 517
- Mrs – 125
Then we run a binary logistic regression using just Title – although we now have replaced a consolidated variable with these dummy variables (excluding Mrs) as the predictors.
Only one variable was not significant in the regression:
So conclusion is that Title is a significant predictor of survival.
Second test was to submit entry to kaggle using best performing model excluding Title:
model <- glm(formula = survived ~ male + pclass + fare + fare_per_person + age_class.interaction + sex_class + combined_age + family + age_squared + age_class_squared, family = binomial(),data = train)
This scored 0.77512, well below my current best score of 0.80861.