A whole range of statistical methods, both traditional and those found in machine learning, assume independence among independent variables. That assumption is pretty important when interpreting the contribution of each variable on the dependent variable (which we call Y).

To unpack:
We say there there is a high degree of collinearity between X1 and X2 if X1 is highly correlated with X2. It doesn’t matter if X1 causes X2. Or if X2 causes X1. The fact would remain that a change in X1 would lead to a predictable lift in X2. And, that a change in X2 would lead to a predictable change in X1.

A Concrete Example

Assume:

  • X1 is the number of people walking past a patio on Peter Street.
  • X2 is the number of people who are sitting on the patio, drinking a beer, on Peter Street.
  • Y is beer sales for that pub operating the patio on Peter Street.

Assume a dataset and a traditional linear regression – and get the equation (illustrative only! Don’t go quoting it. It isn’t real, and I have no idea of relative foot traffic on Peter Street):

Y = 1250 + 0.05 * X1 + 18.22 * X2

And, let’s assume that the model is predictive.

A marketer could predict revenue on any given future date just by estimating X1 and X2.

How much should the marketer of that bar attribute to the location of the bar (placement) as opposed to the choice to invest in a patio (placement)? Clearly the patio is far more important!

Well, hold on – doesn’t X1 logically lead to X2? Doesn’t foot traffic cause people to sit on the patio?

You might see a causal link between X1 and X2. You might even see a reinforcing variable – the expectation of seeing people on the patio causes more foot traffic. And yet, the regression algorithm doesn’t see it.

It doesn’t.

That math simply can’t separate out the contribution of X1 from X2 in a reliable way. Worse, the predictive value of the model generally inflates relative to this these contributions.

(R^2 inflates wildly the more of these that creep in. The Variance Inflation Factor test is of such collinearity among independent variables – but it isn’t baked in.)

The independent variable assumption doesn’t cause too many problems in the laboratory. We can control for enough variables to contain collinearity issues. The independent variable assumption causes very severe problems in the social sciences – and, in particular, marketing.

When the entire world is your petri dish, you have assume that a lot of dirt is going to get into it. So, the relationship between X1 and X2, and labeling them as such, can cause very real problems when it comes to something marketers currently care about. Like attribution.

Tomorrow, I’ll talk about the variable that you’re anticipating, W1, which we will call weather.

***

I’m Christopher Berry.
I tweet about analytics @cjpberry
I write at christopherberry.ca