I’ve been participating in the “Getting Started” competition
on kaggle.com that uses a data set about the Titanic disaster. It’s an ideal
competition for me : it’s an ideal starting place for people who may not have a
lot of experience in data science and machine learning.
Now, after 2 months and 51 submissions, I’ve progressed from
a score of 0.62679 to 0.80861. I’m now in position 71, tied with 27 other competitors.
As this is a learning competition, here’s what I’ve learnt
to score 0.80861:
Model
So far, I’ve stuck with binary logistic regression. It’s a
competitor to more fashionable classification methods, but it works. I’d like
to see if I can improve my score any further using logistic regression.
Initially I started using SPSS (because that’s what we use
in class), but now I’ve moved to R.
The model I submitted to get my best score is :
glm(formula = survived ~
male + pclass + fare + fare_per_person +
Title + age_class.interaction + sex_class +
combined_age +
family + age_squared + age_class_squared,
family = binomial(),
data = train)
I’ll explain the R code:
 I’m using the glm() function in the R stats package
 the predictor variables in the model are
o
male – this is the “sex” variable in the data
set from kaggle. I’ve just changed male/female to 1/0.
o
pclass 
no change from the pclass variable in the kaggle data set.
o
fare – no change from the fare variable in the
kaggle dataset
o
fare_per_person – I have calculated the number
of people travelling together (sibsp + parch + 1) and divided the fare variable
by that number
o
Title – extracted the Title of each passenger
from their name. I used Excel to do this and other data manipulation.
o
Age_class.interaction – multiplied “Combined Age”
by pclass
o
sex_class – multiplied sex ( 1 or 2) by pclass.
o
combined_age – this is the age of the passenger,
with missing values replaced by the median age for each Title. So if the age was missing
for Mr Smith, then I’ve used the median age for all passengers with Title “Mr”.
Whenever I refer to age, this is the variable I’m referring to.
o
family – sibsp + parch
o
age_squared – combined_age squared
o
age_class_squared – age_class squared
 family – setting this parameter of the glm() function to “binomial” tells R you want a binary logistic regression model. Other options here include :
§ binomial(link = "logit")
§ gaussian(link = "identity")
§ Gamma(link = "inverse")
§ inverse.gaussian(link = "1/mu^2")
§ poisson(link = "log")
§ quasi(link = "identity", variance = "constant")
§ quasibinomial(link = "logit")
§ quasipoisson(link = "log")
 data – the name of the data file is train.
What Hasn’t Worked (At Least So far)

Using linear regression to predict age (where
age is missing)
Using the median age of each Title
for missing values gave me a significant score improvement, so I had the idea
of using linear regression to hopefully predict age more accurately. This
unfortunately didn’t work, and I’ve made some notes on this here and here and here.

trying different cut points
Logistic regression produces a
probability that a case belongs in the reference category of the dependent
variable – here, the probability that a pasenger survived. The default cut
point is a probability of 0.5 – 0.5 and higher is one, and under 0.5 is zero.
I experimented with cutpoints above
and below 0.5, but so far this hasn’t worked – probably because we’re trying to
maximise total correct predictions (rather than correctly identify one category
or the other).

Including the following variables
§
age missing indicator (1 for age missing, 0 for
age present)
§
including sibsp and parch separately , instead
of combined “family”
§
logarithm of fare (log transformations are often
used with financial variables)
§
adjusted cabin – taking the first letter of each
cabin (A,B and so on, with pclass being used for missing values)
§
cabin missing indicator
§
embarked (variable in kaggle data set)
§
3 way interaction between age, pclass and sex
Hi Graham,
ReplyDeleteThank you very much for the help and the link to the blog. I have a question. Is 'title' an important predictor? Is it making a significant difference?
Hi Graham,
ReplyDeleteCame across your blog recently and just wanted to congratulate you on some really interesting observations (in particular your work looking at title). I am also impressed how you have really drilled down into the data. We are actually sitting next to each on the leader board for this competition  my own best model is an SVM. This is unfair as I used your title work! So I thought I would make a couple of suggestions to you that may or maynot be useful.
I highly recommend having a look at the caret package for R as it really streamlines the preprocessing and model building steps in R. It also contains built in imputation arguments to replace missing data.
Alternatively I have found the MICE package to be very easy to use as a way to impute missing data. Anyway, congratulations again on a great blog and thanks for your postings on the Forum.
Stephen
Hey, thanks for the insights in the Titanic competition! I am new to all of this and these are incredibly helpful for beginners!
ReplyDelete