Monday, April 15, 2013

Titanic Data Competition - Submission 4

Submission 4 is actually three submissions (a , b and c) and it's taught me that I need to go back to basics - all three submissions scored worse than one of my first submissions, which was a simple model comprised of

- gender, pclass, age, with survival = 0 where age was missing.

Clealy not an original model !

These are the variables I tossed into today's submissions:

  1. pclass
  2. sex
  3. Age missing
  4. combined age - which substituted an imputed age where age was missing.
  5. sibsp
  6. parch
  7. family - which is the total of sibsp and parch
  8. fare
  9. log fare
  10. adjusted cabin - which takes the first letter of the cabin, and substitutes X, Y, Z for missing values (where X, Y and Z represent 1st, 2nd and 3rd class respectively)
  11. cabin missing
  12. embarked

Submission 4b included the same variables. The difference was with this submission I experimented with changing the cut, and ended up going with 0.59, which gave me the best classification score using SPSS.

Submission 4C included same as first two, but in this case, I treated combined age and fare as categorical variables.

These models should have resulted in a great score according to the SPSS classification table (version C correctly classified 92.7% according to SPSS). The problem no doubt is that this figure is calculated on the training set, and not the test set. ( I've not split the training set into training / test).


So future strategy has to be to go back to to the simple model referred to above, and build from there.


No comments:

Post a Comment