The troubling trouble with bias

Posted onNovember 6, 2012 by Christopher Berry

Most discussions of statistical bias, in the world of sampling, revolve around the actual randomness of the sampling. Is there a systematic bias in the way the sample is collected: either in terms of those who have been selected to participate, those who opt to participate, and those who chose to answer specific questions.

It’s commonly argued that if the sample is biased, you have to throw away the whole data set, because the sample is not representative of the overall population. And, in general, we confine our discussions of bias to the nature of the sampling, or, how summary statistics vary against what is expected.

There’s another type of bias that revolves around inferring causality.

Statisticians generally don’t enjoy talking about causality. They make statements about how things are related. But they’re careful not make a statement about causality. To do so would invoke quite a few messy assumptions. The world of inferring causality is left up to the modellers and the theoreticians.

There’s an idea called positivism. I suppose, somewhere along the line, statisticians just thought that their big contribution could be making assertions about the world and collecting facts about it. Kind of similar to how Victorians had this thing for collecting, tagging, and sorting insects.

Making inferences about causality introduces a far different type of bias. If my model makes accurate predictions about the future, then it’s a good model. Making accurate predictions is requisite for optimization. However, there exists a reality outside of my model, and, there’s no guarantee of actual causality. My model may generate great predictions, but for the wrong reasons. Horreur! Bias!

These types of meta-physical arguments are troubling for many. It’s a kind of recursive argument, which are generally hard all by themselves. A model cannot, by definition, bloat to the point of reality. So, the entire course of science then is creative destruction. There’s this idea that the degree of truth is variable over time. I can be right today and I can be more right tomorrow.

Those who subscribe to the machine learning perspective don’t care. So long as the machine generates excellent predictions, who cares about the feelings of a statistician. Or their knowledge. Indeed, they can rightfully argue that this navel gazing conservative positivism has held statistics, and humanity, back.

The most troubling part about bias is that many understand that it’s powerful sword, but don’t actually know how to use it. Simple statements about reality, instrumented well, cause intense controversy within digital analytics. We’re not even talking about causal models here. We’re not even close.

I’m Christopher Berry
I’m part of the team building Authintic.