Have you ever heard anybody use the sentence:

“The problem with that model is that it over fits the data.”

Ever wonder that that means?

The purpose of science is to use knowledge to make good predictions about the future. To do so, you use theories which inform models. Models are deliberate simplifications of the world which make explicit statements about the direction of the arrow of causality, and are judged to be useful only if the assumptions are actually good.

A good model makes accurate predictions about the future. That supposes that the assumptions which underpin the model are actual best-proxies for how nature actually works.

[Data scientists: If you have a problem with what I wrote here, leave a comment or email me.]

Models can be calibrated to over fit one data set, and in so doing, fail to make accurate predictions about the future.

There’s a convention, at least in Marketing Science, that you have actual empirical data to support your model. That is to say, you have a dataset, and in order to explain something in nature, you’re creating a piece of math whose purpose it is to replicate those results.

There’s a way that scientists break up a single data source into multiple sets. One part is used to tune a model, and another part is held ‘out-of-sample’ to validate the model. It’s a way of making a prediction about the model the results of the model before it goes out to be replicated by others, in other contexts.

Sometimes a model predicts a single data set extremely well. Say, a given model predicts the eBay auction price of an American League baseball card extremely well for the years 2003 through 2005, however, when given data for 2006 to 2011, it ceases to make accurate predictions.

When a model is too specific, too fine tuned to a given data set, we say that it is overfitting the data.

Overfitting technically makes a paper appear very good, you know, hooray, your model worked – high five, drinks time. However, it may not be of any actual use to anybody else. Overfitting can fool people into believing they actually found something useful.

That’s what overfitting means.


I’m Christopher Berry.
I tweet about analytics @cjpberry
I write at christopherberry.ca