Why seeing the distribution of data is important

Posted onJanuary 16, 2012 Edit onAugust 7, 2021 by Christopher Berry

SPSS, R, and Python (matplotlib) have very functional visualization libraries because seeing the data is vital, even when armed with statistical methods.

The chart below, called Anscombe’s Quartet, illustrates why:

All four data sets return the same summary statistics:

Their averages are all 9.

The correlation between x and y are all 0.816.

They can be described by the best fit linear regression equation y = 3 + 0.5x.

It’s important to visualize the data, even when relatively powerful summary statistics are available, because:

Outliers are common in most data, deserve special attention, and can cause very large skews.

You may need something a bit heavier than linear regression to predict the relationship between x and y.

Summary statistics sacrifice specificity for simplicity, and as such, are not substitutes for understanding.

2 thoughts on “Why seeing the distribution of data is important”

Andy Lepki says:

January 16, 2012 at 1:08 pm

Good post Chris. I occasionally go into Excel with my XY points, add Trendline, and on the “Format Trendline” options, find some polynomial order which coincidentally hits my points… Science at its finest!

Christopher Berry says:

January 16, 2012 at 8:52 pm

Hi Andy,

I find introducing higher order polynomials helpful for explaining variation.

I find that introducing too many variables of higher order causes too much variance, causing the model to drastically overfit the data.

This is a double edged sword. You’re caught between high bias and high variance.

Such is life, right?

Comments are closed.