Emma Gertlowitz: Random Forest 2

Thursday, June 6, 2013

Random Forest 2

Submitted my second random forest entry, and even though it is well under my "best to date entry", it is a good improvement from my first RF entry.

First RF entry : 0.73206
Second RF entry: 0.77033

Here is the equation:

model1 <- randomForest(survived ~ pclass + male + age_missing + combined_age + sibsp + parch + family + fare + log_fare + cabin_A + cabin_B + Cabin_C + cabin_D + Cabin_E + Cabin_F + Cabin_G + cabin_T + Cabin_X + Cabin_Y + cabin_missing + embarked_C + embarked_Q + age_class.interaction + Title_Other + Title_Master + Title_Miss + Title_Mr + fare_per_person + sex_class + std_fare + std_combined_age + std_fare_per_person + age_squared + fare_squared + fare_per_person_squared + age_class_squared , data = rf.train, importance = TRUE,ntree=1000, do.trace = 100)

Here's what I learnt about RF from preparing this entry:

If you are using RF for classification, make sure your target variable is coded as a factor, and not as an integer or numerical data type - otherwise RF will assume you are doing a linear regression model. Here is the R code to change the data type of one variable:

rf.train$survived <- as.factor(rf.train$survived)

Check that all integer or numerical data types are recognised as such, and not as factors. If a numerical variable has been read by R as a factor, you can use similar code as above to correct.
Any factors need to have identical levels in train and test set. For example, you can't have 3 levels of a factor in your training data set, and 2 of the 3 levels in the test data set.
Goes without saying that variable names in both train and test data sets need to be identical.

I need to work out how to use the abbreviated formula version to simply the formula when all variables are being included:

model1 <- randomForest(survived ~ . , data = rf.train, importance = TRUE,ntree=1000, do.trace = 100)

Here's the other piece of code used to generate predictions:

output <- predict(model1, rf.test, type="response")

4 comments:

Stephen OatesJune 16, 2013 at 9:38 PM
One thing I came across in the Amazon competition that was helpful was to glue the train and test data sets together. Only then assign the factor variables as you stated and then split again based on the presence of the predictor. Until I came across that I was unable to submit entires as the test dataset contained factors that the model did not know about!

Just to give another quick plug for the caret package if you use it to run your random forest it will tune the value of Mtry for you. You would run it as:
install.packages("caret")
install.packages("randomForest")
library(caret)
library(randomForest)
rfmodel<-train(train$survived~., data=train, method ="rf")
ReplyDelete
Replies

Add comment

Pages

Thursday, June 6, 2013

Random Forest 2

4 comments: