The-Data-Guy: Can you Kaggle?

Kaggle?

Do you even Kaggle, Bro?

English: Kaggle logo (Photo credit: Wikipedia)

Kaggle is a platform for predictive modelling and analytic competitions on which companies and researchers post their data and statisticians and data miners from all over the world compete to produce the best models. This crowdsourcing approach relies on the fact that there are countless strategies that can be applied to any predictive modelling task and it is impossible to know at the outset which technique or analyst will be most effective. - Kaggle Site Definition

Basically, Kaggle is a site where they peridocially have competitions for Machine Learning, and Data Mining Practitioners.

Generally the format of the competition is this:

Here is a training dataset

Here is a test data set

Here is a sample submission file

The training and test data set have the "key" features that the people designing the competition recognizes as being important. One row per observation with some particular "training" or "outcome" variable. The biggest difference between the two is that the test data set does not have the "training" or "outcome" variable.

This is what you need to accurately predict.

The sample submission file is generally a file with 2 columns an ID variable, and the Response variable.

The ID from the submission is a unique identifier from the test data set, and the response is what you, and your algorithm predict.

There is an evaluation process behind the scenes. Once you submit your prediction, your submission is compared to a set of "known" predictions. The predictions behind the scenes are the ones generally considered to be accurate by the group hosting the competition.

An evaluation score is calculated on your prediction versus the accepted predicted values. Some mechanism like F-Statistic, or RMSE, or area under the ROC are used as a score.

These competitions show a wide variety of industries, and some of the competitions allow you to win money.

Since there is money on the line, there are some groups that take these things incredibly seriously.

They will dedicate resources and time to winning a competition. As such the leaderboard may or may not be an accurate representation for how well someone knows an algorithm or methodology, rather it may represent the amount of time and resources dedicated to winning the competition.

For example, in one competition recently Springleaf the top score on the leaderboard is: .80427 second place is: .80394 a difference of .00033

The difference between my score and the leader was .04759!

These competitions can be fun, exciting, and provide for opportunities to try new tools, methods, and algorithms. However, unless you dedicate quite a decent amount of time and energy into them you may not come in first place.

The-Data-Guy

Pages

2016-01-29

Can you Kaggle?

Kaggle?

No comments:

Post a Comment