In this article, I'll outline how to use logistic regression in R to produce an entry in the Titanic machine learning competition.
I'll use R with the R Studio environment as there are a lot of advantages in using R Studio as opposed to the generic R interface. You can download R Studio here.
The first step is to download the training data set from kaggle.com. Then, in R Studio, import the file. The menu options to do this are : tools / import data set / from text file.
Now have a look at the file :
head(train)
survived pclass name sex age sibsp parch 1 0 3 Braund, Mr. Owen Harris male 22 1 0 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0 3 1 3 Heikkinen, Miss. Laina female 26 0 0 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 5 0 3 Allen, Mr. William Henry male 35 0 0 6 0 3 Moran, Mr. James male NA 0 0 ticket fare cabin embarked 1 A/5 21171 7.2500 S 2 PC 17599 71.2833 C85 C 3 STON/O2. 3101282 7.9250 S 4 113803 53.1000 C123 S 5 373450 8.0500 S 6 330877 8.4583 Q
The head() function returns the first 6 rows of your file.
Then have a look at the structure of the file:
str(train)
'data.frame': 891 obs. of 11 variables:
$ survived: int 0 1 1 1 0 0 0 0 1 1 ...
$ pclass : int 3 1 3 1 3 3 1 3 3 2 ...
$ name : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
$ sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
$ age : num 22 38 26 35 35 NA 54 2 27 14 ...
$ sibsp : int 1 1 0 1 0 0 0 3 0 1 ...
$ parch : int 0 0 0 0 0 0 0 1 2 0 ...
$ ticket : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
$ fare : num 7.25 71.28 7.92 53.1 8.05 ...
$ cabin : Factor w/ 148 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ...
$ embarked: Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
The str() function shows, among other things, the data type of each variable. That can be important depending on what method you're using. In this instance, the "survived" variable is an integer, but some methods will need it coded as a factor (with two levels).
Another useful function is summary().
In this example, I'm going to produce a very simple binary logistic regression model, using only some of the variables in the original file downloaded from kaggle.com.
Once you have downloaded the data, you then need to work out what preparation / pre-processing is required:-
- missing data
- feature extraction
- logistic regression assumptions
.
Missing Data
.
The two variables in the Titanic data set with the most missing values are
- age
- cabin
.
Common methods for dealing with missing values include
- casewise or listwise deletion - not recommended with a small data set
- replacing missing values with the mean or median
- using some other method for estimating the missing value. For example, with this data set, one can use the mean/median age for each Title (Mr, Mrs, Miss, Master, etc) to obtain a replacement value. Similarly with Cabin, the pclass can be used to create a "dummy" cabin value.
.
Feature Extraction
This is where you create new variables from existing variables. For example, a new variable indicating whether a passenger traveled with or without family could be created from the sibsp and parch variables (If sibsp and parch both equal 0, then passenger traveled without family).
.
Another example of feature extraction is adding a polynomial term (eg, fare squared) to indicate a non-linear relationship, or transforming a variable (for example, taking the log of a value (particularly with financial variables)).
.
An interaction term can also be included - sex*pclass - were you suspect that survival experience differs by pclass and sex (eg, survival experience of 1st class female passengers differs from that of 1st class male passengers)
.
Finally, some levels of a variable may have too small numbers to be useful. For example, it might be worthwhile consolidating 3 and over into one group for both parch and sibsp.
.
Logistic Regression Assumptions
It's important to be aware of the assumptions underlying a particular method and / or it's implementation.
.
For example, some machine learning methods require centering and scaling (SVM)
Logistic regression is fairly flexible; however it is worthwhile reading something about each method you are using to understand its assumptions and requirements.
Exploratory Data Analysis
Before you start model building, it's important to look at the data.
With the Titanic data set, I started off by:-
- looking at the distribution (continuous data) or frequencies (discrete data) of each variable
- look at the bivariate relationship between each variable and "survived".
Charts are often the most effective way of exploring the data. For example, with continuous data, produce a histogram overlaid with a density plot.
The logistic regression model in R is in the glm() function, in the stats package. The stats package is part of the standard R installation - with most packages you need to install the package on your computer ( install.packages("package.name") ) and then load it (library(package.name) ).
The code to generate a logistic regression model is :
my.model <- glm(survived ~ pclass + sex, data = train, family = binomial())
You can see information about your model by using the summary command :
summary(my.model)
Call:
glm(formula = survived ~ pclass + sex, family = binomial(), data = train)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.2030 -0.7036 -0.4519 0.6719 2.1599
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.2946 0.2974 11.077 <2e-16 ***
pclass -0.9606 0.1061 -9.057 <2e-16 ***
sexmale -2.6434 0.1838 -14.380 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1186.7 on 890 degrees of freedom
Residual deviance: 827.2 on 888 degrees of freedom
AIC: 833.2
Number of Fisher Scoring iterations: 4
Here you can see, for example, that both predictors, age and gender, are statistically significant ( with p < 0.001 ).
Now I've added the interaction between class and gender:
my.model <- glm(survived ~ pclass + sex + pclass:sex, data = train, family = binomial())
The summary() functions shows that the interaction term is also statistically significant.
summary(my.model)
Call:
glm(formula = survived ~ pclass + sex + pclass:sex, family = binomial(),
data = train)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.8488 -0.6969 -0.5199 0.4946 2.0339
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 6.0416 0.8273 7.303 2.81e-13 ***
pclass -2.0011 0.2950 -6.784 1.17e-11 ***
sexmale -6.0493 0.8756 -6.909 4.88e-12 ***
pclass:sexmale 1.3593 0.3202 4.245 2.19e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1186.66 on 890 degrees of freedom
Residual deviance: 803.12 on 887 degrees of freedom
AIC: 811.12
Number of Fisher Scoring iterations: 6
I'm now going to use this model to produce predictions using the test data set from kaggle.com.
So I download the test file and import it into R Studio.
Using the head() function, I can see that the test file is virtually identical in structure to the train file, with the exception of course that the test file does not show whether or not the passenger survived.
To produce a set of predictions, you use the predict() function:
titanic.predict <- predict(my.model, newdata = test, type = "response")
Using head(), we can see the first 6 predictions:
head(titanic.predict)
1 2 3 4 5 6
0.1263887 0.5095497 0.2156133 0.1263887 0.5095497 0.1263887
We then need to convert the probabilities into categories - I'm taking the easy way out here, and doing the next few steps in excel:
- set the working directory for R to the desktop - so I can save the file to the desktop- there's a menu option in R Studio to set the working directory - look under "session"
- then save the file : write.table(titanic.predict, file = "titanic_predict.csv", sep = ",")
- open the file in excel
- convert the probabilities to categories (0 or 1) - use the if function
- paste the categories as the first column of the test file
- upload to kaggle and see what score you get.
This simple model scores 0.76555.
From here, you can try different models. Part of the expertise in using logistic regression is knowing what variables to include or exclude, what variables to transform, and what new variables to create.
If anything is not clear, post your question in the comments section and I'll endeavor to answer.
Great post Graham!
ReplyDeleteHi Emma,
ReplyDeleteThanks for this post ,its really helpful and easy to follow.
could you also explain the R code for missing age values based on the titles?
One of the contestants in the kaggle competition made the suggestion that extracting the titles from the passenger names would be a good way to impute values for those passengers whose age was missing.
ReplyDeleteI used excel to extract the titles - I've posted a screenshot to show the formulas I used ( http://graham-twitterlinks.blogspot.com.au/2013/07/screenshot-of-excel.html ).
The idea is that, for example, the title "Master" indicates with some degree of accuracy the passenger's age. Someone with title Master is most likely under 18. Same thing applies to a lesser degree for "Miss". A woman with the title "Mrs" is probably at least 18.
So it's simply a case of calculating the mean or median age for each title group - where age is available - and using the value to represent the age for that title group where age is missing.
I left the original age variable as is, and created a new variable containing either the age as given or the age calculated as above.
So there is no R code I can show you - all the data munging was done in excel, and then I used R to do the logistic regression from that point on.
Hope that answers your question.