In this article, I'll outline how to use logistic regression in R to produce an entry in the Titanic machine learning competition.
I'll use R with the R Studio environment as there are a lot of advantages in using R Studio as opposed to the generic R interface. You can download R Studio here.
The first step is to download the training data set from kaggle.com. Then, in R Studio, import the file. The menu options to do this are : tools / import data set / from text file.
Now have a look at the file :
survived pclass name sex age sibsp parch 1 0 3 Braund, Mr. Owen Harris male 22 1 0 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0 3 1 3 Heikkinen, Miss. Laina female 26 0 0 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 5 0 3 Allen, Mr. William Henry male 35 0 0 6 0 3 Moran, Mr. James male NA 0 0 ticket fare cabin embarked 1 A/5 21171 7.2500 S 2 PC 17599 71.2833 C85 C 3 STON/O2. 3101282 7.9250 S 4 113803 53.1000 C123 S 5 373450 8.0500 S 6 330877 8.4583 Q
The head() function returns the first 6 rows of your file.
Then have a look at the structure of the file:
The str() function shows, among other things, the data type of each variable. That can be important depending on what method you're using. In this instance, the "survived" variable is an integer, but some methods will need it coded as a factor (with two levels).
Another useful function is summary().
In this example, I'm going to produce a very simple binary logistic regression model, using only some of the variables in the original file downloaded from kaggle.com.
Once you have downloaded the data, you then need to work out what preparation / pre-processing is required:-
- missing data
- feature extraction
- logistic regression assumptions
The two variables in the Titanic data set with the most missing values are
Common methods for dealing with missing values include
- casewise or listwise deletion - not recommended with a small data set
- replacing missing values with the mean or median
- using some other method for estimating the missing value. For example, with this data set, one can use the mean/median age for each Title (Mr, Mrs, Miss, Master, etc) to obtain a replacement value. Similarly with Cabin, the pclass can be used to create a "dummy" cabin value.
This is where you create new variables from existing variables. For example, a new variable indicating whether a passenger traveled with or without family could be created from the sibsp and parch variables (If sibsp and parch both equal 0, then passenger traveled without family).
Another example of feature extraction is adding a polynomial term (eg, fare squared) to indicate a non-linear relationship, or transforming a variable (for example, taking the log of a value (particularly with financial variables)).
An interaction term can also be included - sex*pclass - were you suspect that survival experience differs by pclass and sex (eg, survival experience of 1st class female passengers differs from that of 1st class male passengers)
Finally, some levels of a variable may have too small numbers to be useful. For example, it might be worthwhile consolidating 3 and over into one group for both parch and sibsp.
Logistic Regression Assumptions
It's important to be aware of the assumptions underlying a particular method and / or it's implementation.
For example, some machine learning methods require centering and scaling (SVM)
Logistic regression is fairly flexible; however it is worthwhile reading something about each method you are using to understand its assumptions and requirements.
Exploratory Data Analysis
Before you start model building, it's important to look at the data.
With the Titanic data set, I started off by:-
- looking at the distribution (continuous data) or frequencies (discrete data) of each variable
- look at the bivariate relationship between each variable and "survived".
Charts are often the most effective way of exploring the data. For example, with continuous data, produce a histogram overlaid with a density plot.
The logistic regression model in R is in the glm() function, in the stats package. The stats package is part of the standard R installation - with most packages you need to install the package on your computer ( install.packages("package.name") ) and then load it (library(package.name) ).
The code to generate a logistic regression model is :
- set the working directory for R to the desktop - so I can save the file to the desktop
- there's a menu option in R Studio to set the working directory - look under "session"
- then save the file : write.table(titanic.predict, file = "titanic_predict.csv", sep = ",")
- open the file in excel
- convert the probabilities to categories (0 or 1) - use the if function
- paste the categories as the first column of the test file
- upload to kaggle and see what score you get.
This simple model scores 0.76555.
From here, you can try different models. Part of the expertise in using logistic regression is knowing what variables to include or exclude, what variables to transform, and what new variables to create.
If anything is not clear, post your question in the comments section and I'll endeavor to answer.