How InfiniteInsight Can Make You A Great Data Scientist

Discussions abound in blogs and forums about the performance of KXEN’s InfiniteInsight™ predictive analytics solution. I think it is time to state KXEN’s public position on this.

This earlier post explained why KXEN developed its own algorithms. As I mentioned there, a majority of KXEN’s customers bought its InfiniteInsight™ technology after benchmarking it against the first-generation tools, like SAS Enterprise Miner and SPSS Clementine, and finding InfiniteInsight™ to be superior.

At KXEN, we strive to automate the tedious, error-prone tasks inherent to predictive analytics — tasks like missing value and outlier processing, variable encoding, variable selection, post-hoc analysis, and so on. Certainly, this does not diminish the job of the data scientist; rather, InfiniteInsight allows the data scientist to focus on more important tasks, like gaining insight from the data.

We believe the role of a data scientist is to solve business problems by using data mining functions to turn raw data into strategic decisions. When a data scientist is using KXEN’s InfiniteInsight as part of his toolbox, he will reach a solution 10 times faster and more consistently than when using first-generation tools.

Frequently, tools like InfiniteInsight are used to apply advanced analytics to processes which previously had no analytics at all. Here we generally see great leaps in predictive accuracy over the methods used before. On some occasions, however, we are called upon to optimize problems to which analytics have already been applied; sometimes an extra 1% improvement in predictive accuracy can be worth millions of dollars!  In these rare cases, this last 1% requires a lot of advanced knowledge and some very sophisticated techniques. Here, too, data scientists can use KXEN’s InfiniteInsight as part of their toolbox to build ensemble models, for example, to get every last bit of predictive goodness out of their data.

As I myself am a data scientist, I wouldn’t be worth my salt if I didn’t have data to back up my claims! So I created a demonstration video (I blogged about it earlier) using the data from the KddCup 98 challenge:

The task is basic enough: to predict donations to a charity for veterans. The data is available to everyone (even the test data is available without the need to register on a site), and there is a nice presentation of the leaderboard, so we can compare the KXEN results with the set of data scientists that submitted their results. The training data set contains ~95,000 lines and 481 columns (which we will heretofore refer to as “variables”).  What is also nice is that the results are not based on an abstract mathematical measure, such as AUC, but on a simple, concrete measurement: dollars donated!  Models are compared with the profit associated with the persons contacted by a model (a person is considered “contacted” by a model if the estimated donation amount is higher than 68 cents); the profit is the donation minus the 68 cents (68 cents being the cost of the contact).

The process with InfiniteInsight is simple: 1) you select the training data set, 2) you describe your 480 variables, 3) you select the continuous target and… you run!. The entire training process takes ~5 minutes and the apply into a database (to be able to use InfiniteInsight Explorer to calculate some statistics) is made in a few seconds.

To represent the fact that a random process is always a difficult beast to forecast, I have used a trick to pretend as if, instead of using KXEN’s InfiniteInsight on 12 different problems, I ran it in 12 parallel universes.  As part of building a model, InfiniteInsight randomly separates the input data into an Estimation set (for training the model) and a Validation set (for automatically testing the model).  One can mimic the variability of InfiniteInsight’s results on a given problem by playing on the random seed used internally by KXEN to separate the training data set into these two parts.

The results are presented below. Table 1 shows the results of the best 21 data scientist teams that participated in this challenge (out of an overall 37 teams who submitted results), and Table 2 presents 12 runs of InfiniteInsight™ V6.0 using different seeds.

KDDCup Participant Results:

Participants Sum of Actual Profits Number
Mailed
Average
Profits
GainSmarts $   14,712.24  56,330 0.26
SAS/Enterprise Miner $   14,662.43  55,838 0.26
Quadstone/Decisionhouse $   13,954.47  57,836 0.24
ARIAI/CARRL $   13,824.77  55,650 0.25
Amdocs/KDD Suite $   13,794.24  51,906 0.27
# 6 $   13,598.05  55,830 0.24
# 7 $   13,040.46  60,901 0.21
# 8 $   12,298.23  48,304 0.25
# 9 $   11,422.77  56,144 0.20
# 10 $   11,276.46  90,976 0.12
# 11 $   10,719.88  62,432 0.17
# 12 $   10,706.34  65,286 0.16
# 13 $   10,112.08  64,044 0.16
# 14 $   10,048.72  76,994 0.13
# 15 $     9,740.72  54,195 0.18
# 16 $     9,463.77  79,294 0.12
# 17 $     5,682.91  51,477 0.11
# 18 $     5,483.67  30,539 0.18
# 19 $     1,924.69  50,475 0.04
# 20 $     1,706.17  42,270 0.04
# 21 $         (53.68)    1,551 -0.03

KXEN’s InfiniteInsight Results:

Seed Sum of Actual Profits Number Mailed Average Profits
2007 15,379.70 50,670 0.30
2010 15,374.70 60,437 0.25
2004 15,136.70 52,864 0.29
2002 15,004.10 60,312 0.25
2001 14,834.00 45,179 0.33
2009 14,746.30 49,637 0.30
2011 14,612.10 38,984 0.37
2008 14,540.80 51,172 0.28
2005 14,489.80 44,508 0.33
2003 14,483.30 53,908 0.27
2006 14,420.90 43,638 0.33
2000 14,400.90 46,949 0.31

There are several findings that can be extracted from these tables.

A given InfiniteInsight run (using a random seed) may not be the best, but it is not far!  Even our worst run is ranked 3rd.  That is OK: we have never claimed to be the best always (that would just be vanity), but to be among the best almost every time, and without tuning.  Bear in mind that while the teams of data scientists in Table 1 probably each spent weeks on their results, each InfiniteInsight run took just 5 minutes!

All the winners used advanced techniques like ensemble models, two-stage models and so on.  However, we have simply used InfiniteInsight Modeler’s regression module out of the box, and that is it!  On average across the 12 runs, a company using KXEN’s InfiniteInsight would have made almost $2,000 more on this problem than the best 12 best data scientist teams that participated to this challenge.

Some data scientists did worse than random (the 13th entry).  Some data scientists even lost money on this challenge (the 21st entry).  InfiniteInsight came up ahead with every run.

I know that is just a single data point, and that one point is not enough to draw a generic rule, but I will ask you to trust me (and the hundreds of KXEN customers) on this:  the performance we see here is very similar to that which we find in real life problems!

The conclusion is this: there are good data scientists, and there are less good data scientists.  I’m not saying that the teams that did worse than random in this challenge are not good, they may have just been unlucky. But if there is a dramatic shortage of data scientists, as is generally acknowledged, you can imagine that good data scientists are even harder to find!

KXEN’s InfiniteInsight automates a lot of the work that a good data scientist does, and so the results you get from it are generally equal to or better than that of the best data scientists.  InfiniteInsight can turn just about anyone into a good data scientist, and it can make a good data scientist into a great data scientist.

One thought on “How InfiniteInsight Can Make You A Great Data Scientist

  1. Very clear message. That’s why I like KXEN. For business consultant, we can have more time to focus on business issue with help of KXEN Infinite Insight, we don’t need to consume our valuable time to doing click and run in multiple times to see the complex result.

Leave a Reply

Your email address will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>