Emma Gertlowitz

The "getting started" Titanic machine learning competition on kaggle.com is a good opportunity to learn how to use R and logistic regression.

In this article, I'll outline how to use logistic regression in R to produce an entry in the Titanic machine learning competition.

I'll use R with the R Studio environment as there are a lot of advantages in using R Studio as opposed to the generic R interface. You can download R Studio here.

The first step is to download the training data set from kaggle.com. Then, in R Studio, import the file. The menu options to do this are : tools / import data set / from text file.

Now have a look at the file :

 head(train)

   survived pclass                                                name    sex age sibsp parch
1        0      3                             Braund, Mr. Owen Harris   male  22     1     0
2        1      1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1     0
3        1      3                              Heikkinen, Miss. Laina female  26     0     0
4        1      1        Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35     1     0
5        0      3                            Allen, Mr. William Henry   male  35     0     0
6        0      3                                    Moran, Mr. James   male  NA     0     0
            ticket    fare cabin embarked
1        A/5 21171  7.2500              S
2         PC 17599 71.2833   C85        C
3 STON/O2. 3101282  7.9250              S
4           113803 53.1000  C123        S
5           373450  8.0500              S
6           330877  8.4583              Q

 The head() function returns the first 6 rows of your file.

Then have a look at the structure of the file:

str(train)
 
 'data.frame': 891 obs. of  11 variables:
 $ survived: int  0 1 1 1 0 0 0 0 1 1 ...
 $ pclass  : int  3 1 3 1 3 3 1 3 3 2 ...
 $ name    : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
 $ sex     : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
 $ age     : num  22 38 26 35 35 NA 54 2 27 14 ...
 $ sibsp   : int  1 1 0 1 0 0 0 3 0 1 ...
 $ parch   : int  0 0 0 0 0 0 0 1 2 0 ...
 $ ticket  : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
 $ fare    : num  7.25 71.28 7.92 53.1 8.05 ...
 $ cabin   : Factor w/ 148 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ...
 $ embarked: Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...

The str() function shows, among other things, the data type of each variable. That can be important depending on what method you're using. In this instance, the "survived" variable is an integer, but some methods will need it coded as a factor (with two levels).

Another useful function is summary().

In this example, I'm going to produce a very simple binary logistic regression model, using only some of the variables in the original file downloaded from kaggle.com.

Once you have downloaded the data, you then need to work out what preparation / pre-processing is required:-

- missing data

- feature extraction

- logistic regression assumptions

Missing Data

The two variables in the Titanic data set with the most missing values are

- age

- cabin

Common methods for dealing with missing values include

- casewise or listwise deletion - not recommended with a small data set

- replacing missing values with the mean or median

- using some other method for estimating the missing value. For example, with this data set, one can use the mean/median age for each Title (Mr, Mrs, Miss, Master, etc) to obtain a replacement value. Similarly with Cabin, the pclass can be used to create a "dummy" cabin value.

Feature Extraction

This is where you create new variables from existing variables. For example, a new variable indicating whether a passenger traveled with or without family could be created from the sibsp and parch variables (If sibsp and parch both equal 0, then passenger traveled without family).

Another example of feature extraction is adding a polynomial term (eg, fare squared) to indicate a non-linear relationship, or transforming a variable (for example, taking the log of a value (particularly with financial variables)).

An interaction term can also be included -  sex*pclass - were you suspect that survival experience differs by pclass and sex (eg, survival experience of 1st class female passengers differs from that of 1st class male passengers)

Finally, some levels of a variable may have too small numbers to be useful. For example, it might be worthwhile consolidating 3 and over into one group for both parch and sibsp.

Logistic Regression Assumptions

It's important to be aware of the assumptions underlying a particular method and / or it's implementation.

For example, some machine learning methods require centering and scaling (SVM)

Logistic regression is fairly flexible; however it is worthwhile reading something about each method you are using to understand its assumptions and requirements.

Exploratory Data Analysis

Before you start model building, it's important to look at the data.

With the Titanic data set, I started off by:-

- looking at the distribution (continuous data) or frequencies (discrete data) of each variable

- look at the bivariate relationship between each variable and "survived".

Charts are often the most effective way of exploring the data. For example, with continuous data, produce a histogram overlaid with a density plot.

The logistic regression model in R is in the glm() function, in the stats package. The stats package is part of the standard R installation - with most packages you need to install the package on your computer ( install.packages("package.name") ) and then load it (library(package.name) ).

The code to generate a logistic regression model is :

my.model <- glm(survived ~ pclass + sex, data = train, family = binomial())
 
You can see information about your model by using the summary command :
 
summary(my.model)

Call:
glm(formula = survived ~ pclass + sex, family = binomial(), data = train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.2030  -0.7036  -0.4519   0.6719   2.1599  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   3.2946     0.2974  11.077   <2e-16 ***
pclass       -0.9606     0.1061  -9.057   <2e-16 ***
sexmale      -2.6434     0.1838 -14.380   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1186.7  on 890  degrees of freedom
Residual deviance:  827.2  on 888  degrees of freedom
AIC: 833.2

Number of Fisher Scoring iterations: 4
 
 Here you can see, for example, that both predictors, age and gender, are statistically significant ( with p < 0.001 ).
 
Now I've added the interaction between class and gender:
 
my.model <- glm(survived ~ pclass + sex + pclass:sex, data = train, family = binomial())
 
The summary() functions shows that the interaction term is also statistically significant.
 
 summary(my.model)

Call:
glm(formula = survived ~ pclass + sex + pclass:sex, family = binomial(), 
    data = train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.8488  -0.6969  -0.5199   0.4946   2.0339  

Coefficients:
               Estimate Std. Error z value Pr(>|z|)    
(Intercept)      6.0416     0.8273   7.303 2.81e-13 ***
pclass          -2.0011     0.2950  -6.784 1.17e-11 ***
sexmale         -6.0493     0.8756  -6.909 4.88e-12 ***
pclass:sexmale   1.3593     0.3202   4.245 2.19e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1186.66  on 890  degrees of freedom
Residual deviance:  803.12  on 887  degrees of freedom
AIC: 811.12

Number of Fisher Scoring iterations: 6
 
I'm now going to use this model to produce predictions using the test data set from kaggle.com.
 
So I download the test file and import it into R Studio.
 
Using the head() function, I can see that the test file is virtually identical in structure to the train file, with the exception of course that the test file does not show whether or not the passenger survived.


To produce a set of predictions, you use the predict() function:


titanic.predict <- predict(my.model, newdata = test, type = "response")
 
Using head(), we can see the first 6 predictions:
 
head(titanic.predict)
        1         2         3         4         5         6 
0.1263887 0.5095497 0.2156133 0.1263887 0.5095497 0.1263887 
 
We then need to convert the probabilities into categories - I'm taking the easy way out here, and doing the next few steps in excel:

- set the working directory for R to the desktop - so I can save the file to the desktop

- there's a menu option in R Studio to set the working directory - look under "session"

- then save the file : write.table(titanic.predict, file = "titanic_predict.csv", sep = ",")

- open the file in excel

- convert the probabilities to categories (0 or 1) - use the if function

- paste the categories as the first column of the test file

- upload to kaggle and see what score you get.

This simple model scores 0.76555.

From here, you can try different models. Part of the expertise in using logistic regression is knowing what variables to include or exclude, what variables to transform, and what new variables to create.

If anything is not clear, post your question in the comments section and I'll endeavor to answer.

Submitted my second random forest entry, and even though it is well under my "best to date entry", it is a good improvement from my first RF entry.

First RF entry : 0.73206
Second RF entry: 0.77033

Here is the equation:

model1 <- randomForest(survived ~ pclass + male + age_missing + combined_age + sibsp + parch + family + fare + log_fare + cabin_A + cabin_B + Cabin_C + cabin_D + Cabin_E + Cabin_F + Cabin_G + cabin_T + Cabin_X + Cabin_Y + cabin_missing + embarked_C + embarked_Q + age_class.interaction + Title_Other + Title_Master + Title_Miss + Title_Mr + fare_per_person + sex_class + std_fare + std_combined_age + std_fare_per_person + age_squared + fare_squared + fare_per_person_squared + age_class_squared , data = rf.train, importance = TRUE,ntree=1000, do.trace = 100)

Here's what I learnt about RF from preparing this entry:

If you are using RF for classification, make sure your target variable is coded as a factor, and not as an integer or numerical data type - otherwise RF will assume you are doing a linear regression model. Here is the R code to change the data type of one variable:

rf.train$survived <- as.factor(rf.train$survived)

Check that all integer or numerical data types are recognised as such, and not as factors. If a numerical variable has been read by R as a factor, you can use similar code as above to correct.
Any factors need to have identical levels in train and test set. For example, you can't have 3 levels of a factor in your training data set, and 2 of the 3 levels in the test data set.
Goes without saying that variable names in both train and test data sets need to be identical.

I need to work out how to use the abbreviated formula version to simply the formula when all variables are being included:

model1 <- randomForest(survived ~ . , data = rf.train, importance = TRUE,ntree=1000, do.trace = 100)

Here's the other piece of code used to generate predictions:

output <- predict(model1, rf.test, type="response")

Emma Gertlowitz

Pages

Wednesday, June 19, 2013

Using Logistic Regression in R

Sunday, June 16, 2013

Titanic Data Set

Thursday, June 6, 2013

Random Forest 2

Wednesday, June 5, 2013

Is Title a Significant Predictor of Survival?