First, I produced charts showing the relationship between the other variables and age. I've not looked for statistically significant relationships - just visula impressions.
There appears to be a relationship between cabins x,y and z and age - this suggests a possible relationship between pclass and age, as these are dummy cabins to replace missing cabin values.
The actual cabins show some relationship with age. For example, cabin G is associated with younger passengers than some other cabins.
No relationship between age and port of embarkation.
No relationship between age and number in family travelling with passenger (large number of people travelled alone).
No relationship between age and fare paid (total fare)
... nor fare per person.
No relationship between age and gender.
No relationship between age and number of parents / children the passenger was travelling with.
Some relationship between age and number of siblings / spouse the passenger was traveling with. For example, if the passenger was travelling with 5 sibsp, then they were no older than 20.
Small relationship between age and pclass. Third class passengers tended to be younger than first class passengers.
A clear relationship between age and title, which is why another contestant suggested using the mean age of
each title to replace missing age values.
Looking at these charts, I then produced a linear regression with the following variables (or indicator variables as appropriate):
- cabin
- Title
- sibsp
- pclass
Here is a plot of age versus predicted age:
For each actual, there is generally a wide range of predicted ages, which is of course not at all useful.
Finally, here is a chart of predicted values against standardised residuals. Not the random picture that we look for :
No comments:
Post a Comment