A model is an abstraction of reality. It is a deliberate simplification of the world so that some part of it can be understood and ultimately predicted. A good model succeeds on two fronts:

  • It successfully communicates which variables cause other variables.
  • It successfully predicts the future with enough accuracy to be useful.

This all requires unpacking.

An abstraction is a deliberate simplification. If I were to ask, “what causes a business to be successful” or “what causes car accidents”, there would be multiple replies. With respect to a car accident, people might reply “bad drivers”. A policy expert might say, “driver distraction, impaired driving, suspended drivers driving, speed, congestion, road engineering, signal engineering, compliance with posted signs and laws”. These variables are salient. (At least, in my view.) There may be hundreds of additional variables that are truly salient. Driver experience, driver sight acuity, brake maintenance and weather conditions are four that I left out. But they could be salient, too.

Models have a dependent variable. It’s what you’re trying to predict. It’s the reason for the model. “Accidents” and “Profit” are two such dependent variables.

Models have multiple independent variables.

How these variables are organized, relative to one another, is modeling.

Causality

The common phrase is that correlation is not causality. So what does that mean? It means that if I run a statistical test, and get a btau of 0.900 between two variables, that observation of correlation does not mean there’s a causal link. The statistical test, that statistical method, btau, can’t possibly have the context of causality.

The model contains that context. Not the statistical test. A model is an assertion of causality. When presenting a model, you are saying “this variable happens as a consequence of this other variable.”

The statistical test informs the predictiveness of the model.

Accurate enough to inform a decision

This part is particularly important.

A model is an abstraction because reality is complex. If you boil out the stuff that doesn’t matter, you’re left with variables that make a high amount of impact on the dependent. A totally complete model is as complex as reality. But it isn’t reality. Worse, because you’re understanding of reality will always be incomplete, you risk damaging the predictiveness of the model.

Say you have 5 variables. The dependent is “likelihood a person will cause an automotive accident today”. The independent variables are “number of hours a driver is driving”, “time of day driver is driving”, “driver has a suspended license”, “age”, and “road condition”. You can run statistical methods to assess the strength of the relationship between each variable with each other, and with the dependent variable.

A good story to tell from that model might be:

“If I know how many hours somebody is driving, the times of day they’re on the road, if that driver has a suspended license, their age, and the condition of the roads – I can predict, with 70% accuracy, whether or not that person will cause an accident. For instance, a 19 year old driver on the road at night, driving for 9 hours that day, during a storm, with a suspended license – is 940 times more likely to cause an accident than a 54 year old driver, on the road during the afternoon, driving for 30 minutes, with clear road conditions and no suspended license.”

That prediction might sound obvious, but there’s a quantification there. The model expresses by how much something is likely to happen.

Also note the restraint. There might be another model, selecting 7 different variables, that produces a more predictive result. There might exist a model containing 120 variables that is 79% accurate. That’s fine. A model containing 120 variables may be actionable by some decision makers. It won’t be useful for most people.

Restraint

In the next section, we’ll explore salient model making with a deliberate eye for predictability and actionability. I will also hammer the point home about restraint.