Submission 3 was supposed to be an improvement on submission 2 - where I had used a logistic regression to produce predictions using age, gender, pclass and fare as predictors. This model generated missing predictions where age was missing. In this case, I merely replaced the missing predictions with zeros - this made sense as a majority of Titanic passengers did not survive.
Submission 3 used a secondary model to generate predictions where the main model produced a missing prediction. The secondary model was a logistic regression with gender, pclass, and fare.
I'm surprised that this "thoughtful" model didn't outperform the somewhat arbitrary model from submission 2.
The kaggle scores were 0.74641 (submission 3) and 0.76555 (submission 2).
Other misc observations:-
- the test set has one missing value for fare. The training may have had missing values - where the fare was 0.
- the primary and secondary models mentioned above generated very similar predictions - excluding missing predictions, there were only 28 (out of 418 cases) where the predictions differed.
Where to next ? Firstly, I need to spend some more time working on producing a full training data set, with all the indicator and / or generated variables that I consider worth working with. That way I can produce a more nuanced logistic regression model. And I need to understand logistic regression in more detail. For example, in my current models, I've used prob = 0.5 as the threshold for predicting survival versus non-survival. I need to see if altering the threshold improves the predictions. Maybe other flavors of logistic regression will produce a better result.