## Sunday, April 21, 2013

### Titanic Prediction Competition - Where Am I Up To?

I wanted to stop for a moment, reflect on what approaches I've been taking with the Titanic Prediction Competition, and plan how I might tackle this competition in future.

This competition fits in nicely with my studies - I'm enrolled in Master in Science (Applied Statistics) programme at Swinburne University (Melbourne, Australia). I am at the 2/3rds mark, and the unit I'm currently doing (Advanced Topics in Regression) fits in nicely with this competition.

Up till now, the course has focussed primarily on what I'd call standard statistics - linear regression with normal distributions. Advanced Topics in Regression takes us beyond this and introduces us to modelling techniques that can be used when the assumptions of normality and linearity don't apply. These techniques include:
• transforming either predictor and / or response variables
• creating new predictors (eg: polynomial terms, indicator variables, interactions)
• piecewise regression
• non-linear regression
• weight least squares regression
• loglinear analysis
• generalised linear models
• binary / ordinal multinomial logistic regression
• multilevel regression
But knowing new modelling techniques is not the only advantage of being 2/3rds through an Applied Statistics course. I'm also developing what might be called an intuitive understanding of what data analysis is all about:

• becoming better at defining and re-defining your research question
• knowing which analysis technique is the most appropriate for your data and research question (what are the data assumptions of the method, and does the data at hand meet those assumptions, and what are the implications when the assumptions are violated.
• understanding that residuals are important
•  knowing that selection of predictors is important (and this includes extracting new variables from the data you start with)
• Importance of exploratory data analysis

So the knowledge I gained from my course means that competitions like this one provide valuable experience:

• relatively small size of data set - this means the challenges are analytical rather than programming / computer science challenges of handling big data - if my programming skills are not up to it, I can still do something manually (in excel)
• real world issues of missing data  - the data sets you deal with, even at grad school, tend to be neatly packaged and designed to illustrate the concept / technique being studied.
• benchmarking - Kaggle tells you where you are in comparison - based on my current score (1098 for 0.77512) there are a lot of people who have extracted a lot more information out of the data set than I have.
My aim is to initially use the techniques I'm learning in Advanced Topics in Regression

• Cross tab / chi square
• Log linear analysis
• binary and multi-nomial logistic regression/ robust logistic regression
• generalised linear regression
Next, I'm going to try out the techniques covered in my next unit (Statistical Marketing Tools) , which covers data mining tools.

Finally, writing posts like this forces me give a bit of thought about what I am doing.

Exploratory Data Analysis

Exploratory data analysis is a two step process - it means looking at the accuracy of your data set, and it means understanding your data set.

Here, we don't need to worry about accuracy. Kaggle have provided a data set that is internally consistent. That's unlike many real world data sets. There you're faced with transcription and coding errors, out of range values (negative or zero ages, incomes with a decimal point in the wrong place) and such like.

So with the Titanic data set, our aim is to understand the data.

For myself, the key parts of exploratory data analysis are to look at each variable as follows:

1. Produce a histogram or bar plot to look at the distribution of values
2. Crosstabulation / chi square analysis with the dependent variable (or alternatively, a logistic regression with just the variable in question as the predictor.
3. Summary report (like the Frequencies or Descriptives reports in SPSS)

Missing Values

One piece of information that comes from exploratory data analysis is whether there are any missing values for a variable.

With the Titanic data set, there are several variables with missing values, including:

• age
• cabin

Age

There are 177 missing values for age.

The discussion forum on the competition website has had a fair amount of comment devoted particularly  to handling the age missing values. Age is a significant predictor of survival, and therefore it makes sense to interpolate an accurate as possible age where it is missing.

The methods of interpolating the missing values for age include

1. If age missing, predict non-survival
2. average age overall
3. use average age for each title group (title = Mr, Mrs, Master, Miss, etc)
It's also possible to substitute some other central tendency value, usually median
I think it is worth seeing if a regression model can be built to predict age from all other variables (excluding survival)

Finally, one could manually interpolate age by looking at position of the individual in a family. If a person traveled with another family, then it may be possible to "guess" age based on who the other family members are (eg, if person traveled with spouse or sibling, then that would potentially give a reasonable indication of the persons age).

Cabin

One contributor to the discussion forum suggested the following method to replace missing cabin value:

• replace missing value with passenger class

There are 687 missing values for cabin - so the values that are present would arguably not provide a lot of information.

Feature Selection and Construction

Selecting the right variables, and extracting all predictive features from the data set is important.

Ways to improve model could include:

• Use backward elimination to build model.
• age and pclass are not linear predictors of survival - how to best construction an interaction term
• extract title and use as a variable
• investigate if there are other interactions present
• construct fare per person variable, and look how this compares with total fare variable
• which procedures requires dummy / indicator variables for analysis
• what information does the cabin variable provide (particularly with so many missing values)
• do I need to normalize continuous variables
• can these methods be used in an ensemble manner
• with decision cuts, how to model the result of different cuts on results
• would any variables benefit from squaring or cubing etc
• create a family variable
• is it worth using name to create ethnicity variable