To consider whether Title is a significant predictor of “survival”:
The
variable “Title” contains 17 levels, most of which have very low frequencies –
see table below.
First
step is to consolidate all levels with low frequencies into an “Other” level.
This
gives us the following frequencies:
-
Other – 25
-
Master – 40
-
Miss – 184
-
Mr – 517
-
Mrs – 125
Then we run a binary logistic regression using just Title –
although we now have replaced a consolidated variable with these dummy
variables (excluding Mrs) as the predictors.
Only one variable was not significant in the regression:
So conclusion is that Title is a significant predictor of
survival.
Second test was to submit entry to kaggle using best
performing model excluding Title:
model <- glm(formula = survived ~ male + pclass
+ fare + fare_per_person + age_class.interaction + sex_class + combined_age + family
+ age_squared + age_class_squared, family = binomial(),data = train)
This scored 0.77512, well below my current best score of
0.80861.
No comments:
Post a Comment