## Thursday, June 6, 2013

### Random Forest 2

Submitted my second random forest entry, and even though it is well under my "best to date entry", it is a good improvement from my first RF entry.

First RF entry : 0.73206
Second RF entry: 0.77033

Here is the equation:

`model1 <- randomForest(survived ~ pclass + male + age_missing + combined_age + sibsp + parch + family + fare + log_fare + cabin_A + cabin_B + Cabin_C + cabin_D + Cabin_E + Cabin_F + Cabin_G + cabin_T + Cabin_X + Cabin_Y + cabin_missing + embarked_C + embarked_Q + age_class.interaction + Title_Other + Title_Master + Title_Miss + Title_Mr + fare_per_person + sex_class + std_fare + std_combined_age + std_fare_per_person + age_squared + fare_squared + fare_per_person_squared + age_class_squared , data = rf.train, importance = TRUE,ntree=1000, do.trace = 100)`
` `
`Here's what I learnt about RF from preparing this entry:`
` `
• If you are using RF for classification, make sure your target variable is coded as a factor, and not as an integer or numerical data type - otherwise RF will assume you are doing a linear regression model. Here is the R code to change the data type of one variable:
rf.train\$survived <- as.factor(rf.train\$survived)
•  Check that all integer or numerical data types are recognised as such, and not as factors. If a numerical variable has been read by R as a factor, you can use similar code as above to correct.
• Any factors need to have identical levels in train and test set. For example, you can't have 3 levels of a factor in your training data set, and 2 of the 3 levels in the test data set.
• Goes without saying that variable names in both train and test data sets need to be identical.

I need to work out how to use the abbreviated formula version to simply the formula when all variables are being included:

`model1 <- randomForest(survived ~ . , data = rf.train, importance = TRUE,ntree=1000, do.trace = 100)`
` `
`Here's the other piece of code used to generate predictions:`
` `
`output <- predict(model1, rf.test, type="response")`
` `
` `

#### 4 comments:

1. One thing I came across in the Amazon competition that was helpful was to glue the train and test data sets together. Only then assign the factor variables as you stated and then split again based on the presence of the predictor. Until I came across that I was unable to submit entires as the test dataset contained factors that the model did not know about!

Just to give another quick plug for the caret package if you use it to run your random forest it will tune the value of Mtry for you. You would run it as:
install.packages("caret")
install.packages("randomForest")
library(caret)
library(randomForest)
rfmodel<-train(train\$survived~., data=train, method ="rf")

1. Thanks for the comments.

Having a look at caret.

If I could get a code snippet to use caret for SVM, to start me off, that would be appreciated. Also, any guidance on building models with SVM would be appreciated - eg, any pre-processing, etc.

2. No worries at all - drop me an email if you would like more detailed notes. Here is some code to get you started with svm's using caret:

install.packages("kernlab")
install.packages("caret")
library(kernlab)
library(caret)

svmmodel<-train(train\$survived~., data=train,
+ method="svmRadial",
##alternativley you could use svmPoly or svmLinear or ##svmRadialCost
+ metric="Accuracy",
##alt "Kappa" or "ROC" etc
+ preProc=c("knnImpute",
## this will impute the missing values you can leave it off ##if you have done it some other way
+ "center", "scale"))
## I believe svm's are quicker if you do this
titanicpred<-predict(svmmodel, newdata=test)

3. Sorry - the formatting went awry

install.packages("kernlab")
install.packages("caret")
library(kernlab)
library(caret)

svmmodel<-train(train\$survived~., data=train, method="svmRadial", metric="Accuracy", preProc=c("knnImpute", "center", "scale"))