First RF entry : 0.73206
Second RF entry: 0.77033
Here is the equation:
model1 <- randomForest(survived ~ pclass + male + age_missing + combined_age + sibsp + parch + family + fare + log_fare + cabin_A + cabin_B + Cabin_C + cabin_D + Cabin_E + Cabin_F + Cabin_G + cabin_T + Cabin_X + Cabin_Y + cabin_missing + embarked_C + embarked_Q + age_class.interaction + Title_Other + Title_Master + Title_Miss + Title_Mr + fare_per_person + sex_class + std_fare + std_combined_age + std_fare_per_person + age_squared + fare_squared + fare_per_person_squared + age_class_squared , data = rf.train, importance = TRUE,ntree=1000, do.trace = 100)
Here's what I learnt about RF from preparing this entry:
- If you are using RF for classification, make sure your target variable is coded as a factor, and not as an integer or numerical data type - otherwise RF will assume you are doing a linear regression model. Here is the R code to change the data type of one variable:
- Check that all integer or numerical data types are recognised as such, and not as factors. If a numerical variable has been read by R as a factor, you can use similar code as above to correct.
- Any factors need to have identical levels in train and test set. For example, you can't have 3 levels of a factor in your training data set, and 2 of the 3 levels in the test data set.
- Goes without saying that variable names in both train and test data sets need to be identical.
I need to work out how to use the abbreviated formula version to simply the formula when all variables are being included:
model1 <- randomForest(survived ~ . , data = rf.train, importance = TRUE,ntree=1000, do.trace = 100)
Here's the other piece of code used to generate predictions:
output <- predict(model1, rf.test, type="response")