Pages

Wednesday, June 19, 2013

Using Logistic Regression in R

The "getting started" Titanic machine learning competition on kaggle.com  is a good opportunity to learn how to use R and logistic regression.

In this article, I'll outline how to use logistic regression in R to produce an entry in the Titanic machine learning competition.

I'll use R with the R Studio environment as there are a lot of advantages in using R Studio as opposed to the generic R interface. You can download R Studio here.

The first step is to download the training data set from kaggle.com. Then, in R Studio, import the file. The menu options to do this are :  tools / import data set / from text file.

Now have a look at the file :


 head(train)
 
   survived pclass                                                name    sex age sibsp parch
1        0      3                             Braund, Mr. Owen Harris   male  22     1     0
2        1      1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1     0
3        1      3                              Heikkinen, Miss. Laina female  26     0     0
4        1      1        Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35     1     0
5        0      3                            Allen, Mr. William Henry   male  35     0     0
6        0      3                                    Moran, Mr. James   male  NA     0     0
            ticket    fare cabin embarked
1        A/5 21171  7.2500              S
2         PC 17599 71.2833   C85        C
3 STON/O2. 3101282  7.9250              S
4           113803 53.1000  C123        S
5           373450  8.0500              S
6           330877  8.4583              Q
 
 The head() function returns the first 6 rows of your file.
 
Then have a look at the structure of the file:
 
str(train)
 
 'data.frame': 891 obs. of  11 variables:
 $ survived: int  0 1 1 1 0 0 0 0 1 1 ...
 $ pclass  : int  3 1 3 1 3 3 1 3 3 2 ...
 $ name    : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
 $ sex     : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
 $ age     : num  22 38 26 35 35 NA 54 2 27 14 ...
 $ sibsp   : int  1 1 0 1 0 0 0 3 0 1 ...
 $ parch   : int  0 0 0 0 0 0 0 1 2 0 ...
 $ ticket  : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
 $ fare    : num  7.25 71.28 7.92 53.1 8.05 ...
 $ cabin   : Factor w/ 148 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ...
 $ embarked: Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
 
The str() function shows, among other things, the data type of each variable. That can be important depending on what method you're using. In this instance, the "survived" variable is an integer, but some methods will need it coded as a factor (with two levels).

Another useful function is summary().
 


In this example, I'm going to produce a very simple binary logistic regression model, using only some of the variables in the original file downloaded from kaggle.com.
 
Once you have downloaded the data, you then need to work out what preparation / pre-processing is required:-

- missing data
- feature extraction
- logistic regression assumptions 
.


Missing Data
.


The two variables in the Titanic data set with the most missing values are
- age
- cabin
.


Common methods for dealing with missing values include

- casewise or listwise deletion - not recommended with a small data set
- replacing missing values with the mean or median
- using some other method for estimating the missing value. For example, with this data set, one can use the mean/median age for each Title (Mr, Mrs, Miss, Master, etc) to obtain a replacement value. Similarly with Cabin, the pclass can be used to create a "dummy" cabin value.
.


Feature Extraction

This is where you create new variables from existing variables. For example, a new variable indicating whether a passenger traveled with or without family could be created from the sibsp and parch variables (If sibsp and parch both equal 0, then passenger traveled without family).
.


Another example of feature extraction is adding a polynomial term (eg, fare squared) to indicate a non-linear relationship, or transforming a variable (for example, taking the log of a value (particularly with financial variables)).
.

An interaction term can also be included -  sex*pclass - were you suspect that survival experience differs by pclass and sex (eg, survival experience of 1st class female passengers differs from that of 1st class male passengers)
.


Finally, some levels of a variable may have too small numbers to be useful. For example, it might be worthwhile consolidating 3 and over into one group for both parch and sibsp.
.


Logistic Regression Assumptions


It's important to be aware of the assumptions underlying a particular method and / or it's implementation.
.

For example, some machine learning methods require centering and scaling (SVM)

Logistic regression is fairly flexible; however it is worthwhile reading something about each method you are using to understand its assumptions and requirements.


Exploratory Data Analysis

Before you start model building, it's important to look at the data.

With the Titanic data set, I started off by:-

- looking at the distribution (continuous data) or frequencies (discrete data) of each variable
- look at the bivariate relationship between each variable and "survived".

Charts are often the most effective way of exploring the data. For example, with continuous data, produce a histogram overlaid with a density plot.



The logistic regression model in R is in the glm() function, in the stats package. The stats package is part of the standard R installation - with most packages you need to install the package on your computer ( install.packages("package.name") ) and then load it (library(package.name) ).

The code to generate a logistic regression model is :

my.model <- glm(survived ~ pclass + sex, data = train, family = binomial())
 
You can see information about your model by using the summary command :
 
summary(my.model)

Call:
glm(formula = survived ~ pclass + sex, family = binomial(), data = train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.2030  -0.7036  -0.4519   0.6719   2.1599  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   3.2946     0.2974  11.077   <2e-16 ***
pclass       -0.9606     0.1061  -9.057   <2e-16 ***
sexmale      -2.6434     0.1838 -14.380   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1186.7  on 890  degrees of freedom
Residual deviance:  827.2  on 888  degrees of freedom
AIC: 833.2

Number of Fisher Scoring iterations: 4
 
 Here you can see, for example, that both predictors, age and gender, are statistically significant ( with p < 0.001 ).
 
Now I've added the interaction between class and gender:
 
my.model <- glm(survived ~ pclass + sex + pclass:sex, data = train, family = binomial())
 
The summary() functions shows that the interaction term is also statistically significant.
 
 summary(my.model)

Call:
glm(formula = survived ~ pclass + sex + pclass:sex, family = binomial(), 
    data = train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.8488  -0.6969  -0.5199   0.4946   2.0339  

Coefficients:
               Estimate Std. Error z value Pr(>|z|)    
(Intercept)      6.0416     0.8273   7.303 2.81e-13 ***
pclass          -2.0011     0.2950  -6.784 1.17e-11 ***
sexmale         -6.0493     0.8756  -6.909 4.88e-12 ***
pclass:sexmale   1.3593     0.3202   4.245 2.19e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1186.66  on 890  degrees of freedom
Residual deviance:  803.12  on 887  degrees of freedom
AIC: 811.12

Number of Fisher Scoring iterations: 6
I'm now going to use this model to produce predictions using the test data set from kaggle.com.
 
So I download the test file and import it into R Studio.
 
Using the head() function, I can see that the test file is virtually identical in structure to the train file, with the exception of course that the test file does not show whether or not the passenger survived.

To produce a set of predictions, you use the predict() function:

titanic.predict <- predict(my.model, newdata = test, type = "response")
 
Using head(), we can see the first 6 predictions:
 
head(titanic.predict)
        1         2         3         4         5         6 
0.1263887 0.5095497 0.2156133 0.1263887 0.5095497 0.1263887 
 
We then need to convert the probabilities into categories - I'm taking the easy way out here, and doing the next few steps in excel:
 
- set the working directory for R to the desktop - so I can save the file to the desktop

- there's a menu option in R Studio to set the working directory - look under "session"

- then save the file :   write.table(titanic.predict, file = "titanic_predict.csv", sep = ",")

-  open the file in excel

- convert the probabilities to categories (0 or 1) - use the if function

- paste the categories as the first column of the test file

- upload to kaggle and see what score you get.

This simple model scores 0.76555.


From here, you can try different models. Part of the expertise in using logistic regression is knowing what variables to include or exclude, what variables to transform, and what new variables to create.


If anything is not clear, post your question in the comments section and I'll endeavor to answer.








     
 
 
    



 

 
 
 

3 comments:

  1. Hi Emma,
    Thanks for this post ,its really helpful and easy to follow.
    could you also explain the R code for missing age values based on the titles?

    ReplyDelete
  2. One of the contestants in the kaggle competition made the suggestion that extracting the titles from the passenger names would be a good way to impute values for those passengers whose age was missing.

    I used excel to extract the titles - I've posted a screenshot to show the formulas I used ( http://graham-twitterlinks.blogspot.com.au/2013/07/screenshot-of-excel.html ).

    The idea is that, for example, the title "Master" indicates with some degree of accuracy the passenger's age. Someone with title Master is most likely under 18. Same thing applies to a lesser degree for "Miss". A woman with the title "Mrs" is probably at least 18.

    So it's simply a case of calculating the mean or median age for each title group - where age is available - and using the value to represent the age for that title group where age is missing.

    I left the original age variable as is, and created a new variable containing either the age as given or the age calculated as above.

    So there is no R code I can show you - all the data munging was done in excel, and then I used R to do the logistic regression from that point on.

    Hope that answers your question.

    ReplyDelete