**INTRODUCTION**

So as the saying goes…

“A Picture Paints a Thousand Words”

But not for everyone……!!!

I was talking to a couple of business people recently about data and the great insights that can be gained using data to solve business problems. During our conversation I asked them what tools they use to visualize their data, “we don’t, we don’t like graphs etc., we like numbers” was their response. Needless to say, being a student of data analytics, I was horrified and before I knew it I had turned into my lecturer and started to give them a lesson on Anscombe’s Quartet.

So going back to “a picture paints a thousand words”, a lesson to be had here, it is not immediately true for everybody, however with the wisdom of Anscombe’s Quartet, it is possible to initiate a mind-set change in people.

**Anscombe’s Quartet**

The statistician France Anscombe constructed the Anscombe dataset in 1973.

Anscombe created the dataset to demonstrate the importance of visualizing data and also to highlight the effect that outliers can have on a statistical findings of a dataset.

Anscombe’s Quartet consists of four data sets, that when examined have nearly the identical statistical properties, yet when graphed the datasets tell a very different story.

**Anscombe’s Dataset**

Each of the datasets in the quartet consists of 11 (x,y) points:

**Statistical Properties**

Each Dataset in the quartet consists of the following statistical analysis:

**Pearson’s Correlation Co-efficient**

In our examples we have used the Pearson’s Correlation coefficient, which measures the strength of the linear correlation between two variables, in our case x and y, of quantitative type. The output of this is measured between +1 and -1, where 1 is a total positive relationship, 0 denotes no relationship and -1 is a total negative relationship.

Using this method, the closer the coefficient, r, is to +1 or -1, the stronger the association of the two variables.

With Pearson’s we need to err on the side of caution as it assumes normal distribution of the sample and it requires action to be taken with outliers. Other models may need to be applied where these assumptions are not met. This re-enforces that need to visualise your data.

**Let’s Visualise the Data**

**Anscombe 1** – This graph shows a simple linear positive relationship. It is what we would expect to see, assuming a normal distribution.

**Anscombe 2** – This graph does not appear to be normally distributed. We can however see a relationship between the 2 variables, it appears to be quadratic or parabolic, but it is not linear.

**Anscombe 3** – This graph is showing a clear outlier in the dataset. The data points, with the exception of the outlier are showing what appears to be perfect linear relationship, but because of the outlier the value of the correlation coefficient has been reduced from 1 to 0.816.

**Anscombe 4 **– In this graph we can see that the value of x stays constant with the exception of one outlier. This outlier has created the same correlation coefficient as the other datasets, which is a high correlation, however the relationship between the two variables is not linear.

**MY OWN DATASET**

In my wisdom I misunderstood that coming up with my own dataset meant creating a fifth dataset that has the same statistical values as Anscombe’s.

I was relieved however to hear that coming up with my own dataset meant coming up with four sets of data points that give the same statistical outcomes, mean, variance, correlation and linear regression line as Anscombe. This was a far simpler task and just meant using logic such as doubling each data point. That almost seemed too simple so I tried adding 1 to each data point in Anscombe, both ways worked as follows. I might note that dataset 4 in both instances has a slightly different correlation of 0.817, instead of 0.816, because of rounding, but it is not significant.

**Anscombe’s Dataset Data Points + 1**

**TRYING MY OWN DATASET TO GIVE SAME STATISTICAL VALUES AS ANSCOMBE**

Here we go….

My Plan was to use the equation of the line, to generate y points when x was given. Which would give an exact positive linear equation, 1.

So, then I would have to adjust some y values to get the correct y variance and aim for 0.816 correlation.

**THAT’S WHERE IT ALL WENT PEAR SHAPED!!!!**

I spent way to much time on this, like a dog with a bone, I tried many combinations.

Given more time I expect I would have finally got there….. but how long it would have taken….we will never know!