I've used a regression model to attempt to predict age more accurately.
Using a regression model with indicator variables for the different titles, we achieve an R Squared of 28.6%.
If we include basically all other variables in the regression, the R Square increases to 43.7%.
Using backward elimination, a model with fewer variables had R Square of 43.3%
It will be interesting to see if and how this improves the predictions of a logistic regression model.
The final model was :
Predictors: (Constant), fare_per_person, cabin_G, cabin_F, Embarked_Q, Title_Other, Title_Miss, cabin_Y, Title_Master, Embarked_C, sibsp, male, fare, pclass, Title_Mr
I will do some more work on the residuals and other statistics. For example, the 'all variables in the model" had a maximum Mahalanobis Value of 712; under the simpler model produced by backward elimination, the Mahal. value of this particular case had reduced to 6.7, and the max Mahal. value was now 218.