How digital analysts can manage spurious relationships

Posted onMarch 29, 2012 Edit onAugust 7, 2021 by Christopher Berry

A spurious relationship is when it appears that X causes Y, but in reality, there’s some other variable, W, that causes X and Y, or alternatively, X causes W which causes Y.

Variable W is lurking out there, hidden, messing up your understanding.

And, sometimes, Y is actually causing X. In that instance, it’s the modeler who has made a specification error. They mistakenly specified which one was the dependent variable.

This gives rise the meme “correlation isn’t causation!”

iPad and Conversion

A curious fact emerged a few days ago. Somebody noted how the conversion rate from their iPad catalog was double the website average. The writer didn’t outright claim that the iPad device caused conversion to double. Rather, they expressed that this was curious.

The model is:

X1(iPad App attribution channel) and X2 (All other OS attributes minus iPad) are correlated with Y(conversion).

Time for skepticism!

What is it about the users of the iPad that are different though?

If you reached into some recent market research, you’d find that the average household income (HHI) of iPad owners is falling. The iPad isn’t causing HHI to decrease. Rather, the iPad is getting adopted by a greater portion of the population, causing that average to decline. The owner of the iPad, and tablets in general, are more median. For some categories of products, HHI is indeed correlated with purchase. That wasn’t likely in this case though. So, lets rule out HHI as a lurking variable.

A commenter on another forum commented that the key attribute of users was preference for the brand. Specifically, a user had to go out of their way to search for a niche catalog app and then install it. That’s a lot of effort. The user had to, without a direct clickable method, seek out that app and install it.

They had to be aware of the brand.

They had to have a preference for that brand.

They had to have an iPad.

That sounds like a pretty good hypothesis to test on the next round. If you were to run a good-old RFM on the customer database, what percentage of those customers are past customers? How many of them jumped from the web to the iPad? Are iPad users buying more, consistently? What percentage are new customers?

I’d hypothesize that an iPad app is unlikely to be a particularly good vehicle for new customer acquisition, so that percentage would be pretty low.

I’d also argue that portions of the iPad experience could be optimized to increase cross-sell and up-sell opportunities.

Without that first observation of relating channel performance, no other knowledge would have been gained. Without understanding the relationship between X1…Xn and Y, the hunt for W1…Wn wouldn’t happen.

Fear and Loathing

A lot of people fear and hate spurious relationships. It’s to the point where I think some people use it as an excuse not to ask additional questions. Spurious relationships can cause bad insights and cause bad behavior. They might also cause good behavior.

There are a few concrete steps a digital analyst can take to mitigate those risks:

Have a theory / model that makes sense before running the regression

If you can control for confounding factors, control for them

If you can measure it, you can take steps to manage it. Pay attention to your error terms and be on the look out for high collinearity factors.

Above all else, you should continue asking questions and continue being skeptical.

Nobody has everything figured out yet.

***

I’m Christopher Berry.
I tweet about analytics @cjpberry
I write at christopherberry.ca