Sunday, April 21, 2013

Titanic Data Competition - Predicting Age for Age = Missing Value

Age is a strong predictor of survival / otherwise; therefore, with a relatively large number of missing values for age, it makes sense to try and predict an age value as accurately as possible. One suggestion on the competition discussion forum was to interpolate based on average ages for the applicable title. For example, the average age for people with the title "Miss" was 21.00 .

I've used a regression model to attempt to predict age more accurately.

Using a regression model with indicator variables for the different titles, we achieve an R Squared of 28.6%.

If we include basically all other variables in the regression, the R Square increases to 43.7%.

Using backward elimination, a model with fewer variables had R Square of 43.3%

It will be interesting to see if and how this improves the predictions of a logistic regression model.

The final model was :

Predictors: (Constant), fare_per_person, cabin_G, cabin_F, Embarked_Q, Title_Other, Title_Miss, cabin_Y, Title_Master, Embarked_C, sibsp, male, fare, pclass, Title_Mr

I will do some more work on the residuals and other statistics. For example, the 'all variables in the model" had a maximum Mahalanobis Value of 712; under the simpler model produced by backward elimination, the Mahal. value of this particular case had reduced to 6.7, and the max Mahal. value was now 218.

No comments:

Post a Comment