Saturday, March 30, 2013

Kaggle Titanic Competition

There's a great website which I'm sure you've heard of called Kaggle runs predictive analytic competitions.

Some time ago, Kaggle started offering "Getting Started" competitions, which " provide an ideal starting place for people who may not have a lot of experience in data science and machine learning".  This semester, the subject I am doing at University (Advanced Topics in Regression) fits in nicely with learning about machine learning. What I like is that the tutorials provided by Kaggle show that high power software is not necessary to start and learn about the concepts involved. Even Excel can be used to understand what is involved.

So I've submitted my first entry in the Titanic: Machine Learning from Disaster competition - a default naive entry. The majority of people in the training dataset did not survive, and so I've predicted that no-one in the test datset will survived.

That entry gives me a score of 0.62679, which places me as equal 2175 out of 2295. The top ten scores range from 0.85167 to 0.96172.  It will be interesting to see whether any of the current top entries are disqualified, as information about the outcome for the test set is readily available on the internet.

To place this default submission in context, the Kaggle benchmark entries are:

- Gender, price and class - position 430 (0.77990)
- My First Random Forrest - position 593 (0.77512)
- Gender - position 1177 (0.76555)

The following histograms show the distribution of scores:

1 comment:

  1. Kaggle is awesome.. I've tried Random Forrest, Decision Tree, Neural Network, and lots of other machine learning approaches to this problem. My current score is 0.799 (out of 20ish submissions). When I was unable to raise my score using machine learning only, I finally reached my final score by adding some good old fashion manual SQL analysis to a test set predicted via Random Forrest. It's a great tutorial to get your feet wet, that's for sure. I'd like to point out that your statement "Even excel can be used" is really down playing the power of excel as a data mining tool. Excel is a GREAT tool for data mining. It has many tools for this kind of analysis with little effort. Couple that with the data visualization capabilities of excel and you have a great data mining tool IMO.