You may have read something about ‘Detecting Novel Associations in Large Data Sets’, a paper appearing in Science, 334, 1518 (2011) by David N. Reshef et al.. You can check out the software here.

This is an initial commentary and an explanation about what it’s all about.

The Longer You Look, The More Likely Error will Find You

Take a very large dataset, say, all the customers of AT&T and their calling records 2001-2011, and divide it into to two random but equal sets. Say you didn’t have any hypothesis at all. You just wanted to see what was related to each other in that set. Say, each customer record has 5000 features, including gender, date of birth, credit score, average call durations, most frequently dialed number, and so on. (Note to statisticians: Assume a Pearson R correlation matrix, skip next paragraph).

Assume, further, that you’re going to compare each feature against one another. So, you compared all the ages against all the date of births. And then all the ages against credit scores, and so on. And, the strength of the relationship between those two features was expressed by a single number. The higher that number is, the stronger the relationship between the two. For instance, we might find that credit score and age are tightly correlated – the older one is, the more likely their credit score is to be positive.

You’re likely to find clearly incorrect relationships in such a large table, just by accident. You might find that in Dataset A, for instance, that’s there’s a statistically significant relationship between being a Virgo and having a negative credit score. There might be a relationship between average call duration and being a Capricorn. You know that such a result doesn’t make sense. Why would zodiac sign (derived from date of birth) affect those things? The way that chance works in such large tables is that the longer you look for significant features, the more likely it is that you’ll find a relationship that doesn’t in fact hold in the real world.

In fact, most of those relationships would disappear in Dataset B. However, new, clearly untrue relationships would appear in Dataset B that don’t exist in Dataset A. When you’re dealing with thousands of features, the likelyhood of such phenomenon increases. And that’s even holding everything we know about probability to be true.

In sum, a big reason why you go into a dataset with a hypothesis is to reduce the risk of coming up with something that is wrong, and very unlikely to be repeatable in other datasets.

Linear, Cubic, Exponential, Parabolic, Elipse

Not all relationships are straight lines. Indeed, especially in certain types of logistic regression, we can get very amazing, very beautiful and complex shapes separating one case from another. Diaper usage plotted against age is a parabolic relationship. Think about it. You use a lot of them when you’re young, you go through a lot of them when you’re very old. You don’t need too many of them in early to late age. Linear regression wouldn’t perform very well in detecting that pattern.

Enter Reshef et al and MIC

MIC stands for Maximal Information Coefficient. Reshef et al invented a neat way of looking at relationships between variables that doesn’t rely solely on a key statistical test (Pearson R) to indicate that it’s there. The authors demonstrated how MIC manages to detect correlations between all these complex relationship types – Cubic, Exponential, Sinusoidal – and does it really well. The went further. The created a program that can mine very large datasets and suggest relationships to examine.

What’s the Problem?

Remember that the longer you look, the more likely you’ll find something false, idea? The entire idea of hypothesis testing as the basis of quantitative analysis is an entrenched one. It’s an idea that causes resistance to advanced machine learning algorithms and pattern discovery. Reshef really did a great job in explaining the purpose of MIC. Reshef has merely stated that this is a hypothesis informing machine. You can use the program and MIC to discover relationships that were once really quite hidden. Or very, very difficult to discover without insanely expensive software. I think this is great.

The Opportunity

We’re generating huge amounts of data. The big feature big data problem is increasingly common. This is a great tool to rapidly inform hypotheses – to become smarter before getting smarter. It’s a welcome advancement, and worthy of attention.

If you hear of MIC, just know that a MIC of 0.00 means that there is no correlation between two variables, and that a MIC of 1.00 indicates a perfect correlation between two variables. Be aware that MIC does not imply linearity between the variables, but may be of a much higher order function. The second question you should ask upon hearing a MIC score is ‘at what confidence interval is it significant?’, and, ‘what kind of relationship is it?’. Then deep dive.

I’m excited.