SPSS, R, and Python (matplotlib) have very functional visualization libraries because seeing the data is vital, even when armed with statistical methods.
The chart below, called Anscombe’s Quartet, illustrates why:
All four data sets return the same summary statistics:
- Their averages are all 9.
- The correlation between x and y are all 0.816.
- They can be described by the best fit linear regression equation y = 3 + 0.5x.
It’s important to visualize the data, even when relatively powerful summary statistics are available, because:
- Outliers are common in most data, deserve special attention, and can cause very large skews.
- You may need something a bit heavier than linear regression to predict the relationship between x and y.
- Summary statistics sacrifice specificity for simplicity, and as such, are not substitutes for understanding.