Pages

Tuesday, June 4, 2013

Where am I Up to With the Titanic Competion on Kaggle.



I’ve been participating in the “Getting Started” competition on kaggle.com that uses a data set about the Titanic disaster. It’s an ideal competition for me : it’s an ideal starting place for people who may not have a lot of experience in data science and machine learning. 

Now, after 2 months and 51 submissions, I’ve progressed from a score of 0.62679 to 0.80861. I’m now in position 71, tied with 27 other competitors.

As this is a learning competition, here’s what I’ve learnt to score 0.80861:

Model

So far, I’ve stuck with binary logistic regression. It’s a competitor to more fashionable classification methods, but it works. I’d like to see if I can improve my score any further using logistic regression.
Initially I started using SPSS (because that’s what we use in class), but now I’ve moved to R. 

The model I submitted to get my best score is :

glm(formula = survived ~ male + pclass + fare + fare_per_person +
    Title + age_class.interaction + sex_class + combined_age +
    family + age_squared + age_class_squared, family = binomial(),
    data = train)

I’ll explain the R code:


  •   I’m using the glm() function in the R stats package

  •   the predictor variables in the model are

o   male – this is the “sex” variable in the data set from kaggle. I’ve just changed male/female to 1/0.
o   pclass -  no change from the pclass variable in the kaggle data set.
o   fare – no change from the fare variable in the kaggle dataset
o   fare_per_person – I have calculated the number of people travelling together (sibsp + parch + 1) and divided the fare variable by that number
o   Title – extracted the Title of each passenger from their name. I used Excel to do this and other data manipulation.
o   Age_class.interaction – multiplied “Combined Age” by pclass
o   sex_class – multiplied sex ( 1 or 2) by pclass.
o   combined_age – this is the age of the passenger, with missing values replaced by the median  age for each Title. So if the age was missing for Mr Smith, then I’ve used the median age for all passengers with Title “Mr”. Whenever I refer to age, this is the variable I’m referring to.
o   family – sibsp + parch
o   age_squared – combined_age squared
o   age_class_squared – age_class squared

  •    family – setting this parameter of the glm()  function to “binomial” tells R you want a binary logistic regression model. Other options here include :

§  binomial(link = "logit")
§  gaussian(link = "identity")
§  Gamma(link = "inverse")
§  inverse.gaussian(link = "1/mu^2")
§  poisson(link = "log")
§  quasi(link = "identity", variance = "constant")
§  quasibinomial(link = "logit")
§  quasipoisson(link = "log")
 

  •  data – the name of the data file is train.


What Hasn’t Worked (At Least So far)

-          Using linear regression to predict age (where age is missing)

Using the median age of each Title for missing values gave me a significant score improvement, so I had the idea of using linear regression to hopefully predict age more accurately. This unfortunately didn’t work, and I’ve made some notes on this here and here and here.

-          trying different cut points

Logistic regression produces a probability that a case belongs in the reference category of the dependent variable – here, the probability that a pasenger survived. The default cut point is a probability of 0.5 – 0.5 and higher is one, and under 0.5 is zero.
I experimented with cutpoints above and below 0.5, but so far this hasn’t worked – probably because we’re trying to maximise total correct predictions (rather than correctly identify one category or the other).

-          Including the following variables

§  age missing indicator (1 for age missing, 0 for age present)
§  including sibsp and parch separately , instead of combined “family”
§  logarithm of fare (log transformations are often used with financial variables)
§  adjusted cabin – taking the first letter of each cabin (A,B and so on, with pclass being used for missing values)
§  cabin missing indicator
§  embarked (variable in kaggle data set)
§  3 way interaction between age, pclass and sex



3 comments:

  1. Hi Graham,

    Thank you very much for the help and the link to the blog. I have a question. Is 'title' an important predictor? Is it making a significant difference?

    ReplyDelete
  2. Hi Graham,
    Came across your blog recently and just wanted to congratulate you on some really interesting observations (in particular your work looking at title). I am also impressed how you have really drilled down into the data. We are actually sitting next to each on the leader board for this competition - my own best model is an SVM. This is unfair as I used your title work! So I thought I would make a couple of suggestions to you that may or maynot be useful.

    I highly recommend having a look at the caret package for R as it really streamlines the preprocessing and model building steps in R. It also contains built in imputation arguments to replace missing data.

    Alternatively I have found the MICE package to be very easy to use as a way to impute missing data. Anyway, congratulations again on a great blog and thanks for your postings on the Forum.

    Stephen

    ReplyDelete
  3. Hey, thanks for the insights in the Titanic competition! I am new to all of this and these are incredibly helpful for beginners!

    ReplyDelete