Wednesday, April 3, 2013

Kaggle Titanic Competition - Submission 2

My second submission was a bit of an experiment, I've just learnt about logistic regression in class, and I wanted to try the technique out on the Titanic data set.

I had to use SPSS on this occasion; I couldn't get logistic regression in R to work for me. For some reason, the predict() function generated 891 predictions (the number of cases in the train data set), whereas I wanted it to generate 418 predictions - the number of cases in the test data set.

It was an interesting experiment (for a novice like myself).

Using the Binary Logistic function in SPSS, I set "survived" as the dependent variable and "pclass", "sex" , "age" and "fare" as covariates or independent variables.

The problem I encountered with this is that logistic regression (at least in SPSS) won't generate a predicted group membership value if there is a missing value, and the test data set has 86 missing "age" values. This doesn't cause a problem (?) when generating a model, as the default in SPSS is (I think) case-wise deletion.

As this was an experiment, I decided to submit two entries, replacing the missing predictions with
- 0
- then replaced missing predictions with 1

The first submission generated the following message from Kaggle :

You improved on your best score by 0.13876.
You just moved up 857 positions on the leaderboard.

Using the above logistic regression and substituting missing predictions with "0" scored 0.76555, up from 0.62679 from my first submission.

Interestingly, this was the same score as the default gender based model.:

                    If the passenger is female then survives, if not then does not.

I then replaced missing predictions with "1" - this scored poorly at 0.64115.

My observations

- perhaps a relationship between missing age and survival.
- importance of gender in survival.

My next steps are to:

- be able to generate predictions using logistic regression in R
- develop a more sophisticated logistic regression model

In particular, I'd like to see what sort of result I can get from a model that includes age, but defaults to a model that excludes age if age is missing.


No comments:

Post a Comment