Friday, April 30, 2010

Kaggle - a home for data mining challenges

I got a promotional email today from a new project called Kaggle. Somewhat related to Innocentive, this project aims to connect challenging problems with people that have the right set of skills to solve them. Kaggle is more specifically aiming to host prediction challenges and should appeal more to the data mining communities. For example, the site is currently hosting a challenge about HIV progression where problem solvers are giving a training dataset and asked to predict improvement in a patient's viral load.

I sent a few questions to Anthony Goldbloom (who works for Kaggle) to get a better idea of what the site is about:

Could you just tell me a bit about the company ? 
The project was inspired by an internship I did as a journalist in London in 2008, when I wrote about the use of data by organizations. I am an econometrician by training and I was excited to see the principles we use to forecast economic growth, inflation etc, being applied by organizations. I returned to Australia and resolved to get involved in the broader analytics community. That's how Kaggle was born.  

It looks like a young startup, is this right?
The project is only two weeks old and we've been thrilled with the response - we've attracted over 6,000 unique visitors. 

We launched the Eurovision contest to get things going. In the last few days we released the HIV Progression Prediction competition. This was my introduction to bioinformatics, which seems like a fascinating area - we're hoping to attract more such competitions. Perhaps your readers have ideas or data

Does the name mean anything ?
The name doesn't mean anything. I got tired of coming up with great names and finding they were taken (and that the owner would only sell for $xx,xxx). As a young project,  our funds could be better spent elsewhere, so I built a program that iterated over different combinations of letters and printed a list of available and phonetic domain names. (I put this program on the web for others in a similar situation.) 

How do you hope to be different from what Innocentive is doing ?
The project is solely focused on data competitions. This enables us to offer services - e.g. to help our clients frame their problems, anonymize their data,  etc. 

The platform is also easily extensible, so we can modify it to suit the specific needs of different data competitions. 

We will host a rating system/league table, so that statisticians can use strong performances to market themselves. The rating system also allows us to host forecasting competitions, since the competition host will know who has a track record of forecasting well (and therefore who to pay attention to).

In the medium term, we plan to also offer a tender system, so that consultants can bid for work from organizations and researchers all over the world. From the organization's perspective, the rating system means they know what they're paying for. From the consultant's perspective, they don't have to waste time touting for work and they get access to interesting clients and datasets.