I’ve been participating in the “Getting Started” competition on kaggle.com that uses a data set about the Titanic disaster. It’s an ideal competition for me : it’s an ideal starting place for people who may not have a lot of experience in data science and machine learning.
Now, after 2 months and 51 submissions, I’ve progressed from a score of 0.62679 to 0.80861. I’m now in position 71, tied with 27 other competitors.
As this is a learning competition, here’s what I’ve learnt to score 0.80861:
So far, I’ve stuck with binary logistic regression. It’s a competitor to more fashionable classification methods, but it works. I’d like to see if I can improve my score any further using logistic regression.
Initially I started using SPSS (because that’s what we use in class), but now I’ve moved to R.
The model I submitted to get my best score is :
glm(formula = survived ~ male + pclass + fare + fare_per_person +
Title + age_class.interaction + sex_class + combined_age +
family + age_squared + age_class_squared, family = binomial(),
data = train)
I’ll explain the R code:
- I’m using the glm() function in the R stats package
- the predictor variables in the model are
o male – this is the “sex” variable in the data set from kaggle. I’ve just changed male/female to 1/0.
o pclass - no change from the pclass variable in the kaggle data set.
o fare – no change from the fare variable in the kaggle dataset
o fare_per_person – I have calculated the number of people travelling together (sibsp + parch + 1) and divided the fare variable by that number
o Title – extracted the Title of each passenger from their name. I used Excel to do this and other data manipulation.
o Age_class.interaction – multiplied “Combined Age” by pclass
o sex_class – multiplied sex ( 1 or 2) by pclass.
o combined_age – this is the age of the passenger, with missing values replaced by the median age for each Title. So if the age was missing for Mr Smith, then I’ve used the median age for all passengers with Title “Mr”. Whenever I refer to age, this is the variable I’m referring to.
o family – sibsp + parch
o age_squared – combined_age squared
o age_class_squared – age_class squared
- family – setting this parameter of the glm() function to “binomial” tells R you want a binary logistic regression model. Other options here include :
§ binomial(link = "logit")
§ gaussian(link = "identity")
§ Gamma(link = "inverse")
§ inverse.gaussian(link = "1/mu^2")
§ poisson(link = "log")
§ quasi(link = "identity", variance = "constant")
§ quasibinomial(link = "logit")
§ quasipoisson(link = "log")
- data – the name of the data file is train.
What Hasn’t Worked (At Least So far)
- Using linear regression to predict age (where age is missing)
Using the median age of each Title for missing values gave me a significant score improvement, so I had the idea of using linear regression to hopefully predict age more accurately. This unfortunately didn’t work, and I’ve made some notes on this here and here and here.
- trying different cut points
Logistic regression produces a probability that a case belongs in the reference category of the dependent variable – here, the probability that a pasenger survived. The default cut point is a probability of 0.5 – 0.5 and higher is one, and under 0.5 is zero.
I experimented with cutpoints above and below 0.5, but so far this hasn’t worked – probably because we’re trying to maximise total correct predictions (rather than correctly identify one category or the other).
- Including the following variables
§ age missing indicator (1 for age missing, 0 for age present)
§ including sibsp and parch separately , instead of combined “family”
§ logarithm of fare (log transformations are often used with financial variables)
§ adjusted cabin – taking the first letter of each cabin (A,B and so on, with pclass being used for missing values)
§ cabin missing indicator
§ embarked (variable in kaggle data set)
§ 3 way interaction between age, pclass and sex