Checking Regression Assumptions – Residuals

In simple linear regression we look at the relationship between two variables, we use this relationship to make predictions.

A regression model for a sample population uses results to produce a best fit line for a dataset.

However, if we were to take a new dataset from the same population, the best fit line may not be true.

How well does our model fit?

We create a regression line in order to build a simple linear regression model.

However two conditions must be met before we can apply our model to a dataset:

  1. The y value must have an approximately normal distribution for each value of x.
  2. The y value must have a constant amount of spread (standard deviation) for each value of x.

It is important to check that these conditions are truly met. In effect we are trying to detect anomalies in our data set , i.e.: outliers

Residuals – What are they?

Residuals allow us to see how far off our predications are from the actual data that came in.

To check to see whether the y values come from a normal distribution we need to measure how far the predictions are from the actual data.

A residual, or also referred to as errors, is the difference between the predicted value from the best fit line and the observed value.

If the residual value is large it will not fit well to the line, it will appear some distance from the line. If the residual value is small it will appear close to the line.

Let’s look at an example

Create a dataset

data1 <- c(48.5, 54.5,61.25,69,74.5,85,89,99,112,123,134,142)

data2 <- c( 8,9.44,10.08,11.81,12.28,13.61,15.13,15.47,17.36,18.07,20.79,16.06)

Plot the data to visualize

plot( data1, data2)

residuals1

Create and view a linear model

fit1 <- lm( data2 ~ data1 )

fit1

Call:

lm(formula = data2 ~ data1)

Coefficients:

(Intercept)   data1

3.6938         0.1134

Let’s look at the correlation value

cor(data1, data2)

0.9264765

The correlation value between the two data sets is showing a strong positive relationship

We will create a regression line to visualise this further

abline(fit1)

residuals2

From the graph we can see that most data points fit closely to the line, however we can see a couple of data points with more distance from the line. Possible outliers?

To investigate this further we will now look at our residuals

First, create residual values for our model fit1

residuals(fit1)

Next, make them easier to see, reducing output to 2 decimal places

res <- signif(residuals(fit1), 2)

res

res

We can see from the output that the residuals range from -3.70 to 1.90.

To put the residual values into perspective

We will standardize the values and visualize them, using various methods

fit1.stdRes = rstandard( fit1 )  

Using a normality plot

qqnorm(fit1.stdRes,ylab=”Standardized Residuals”,xlab=”Normal Scores”, main=”Normality Plot”, col=”red”)

qqline(fit1.stdRes)

residuals3

We can see that most data points are close to the line, however one in particular is quite far.

Using an abline

plot( fit1.stdRes, col=”red”)abline(0,0)

residuals4

Using standardized residuals we can really get a good understanding of how far the data points are from the line.

Using an Histogram

hist( fit1.stdRes, col=”red”)

residuals5

Our histogram is indicting that there is not a normal distribution in our data.

Conclusion

In our initial observation of our data set, we suspected a possible outlier.

Using this example we have concluded that there is a definite need to investigate one of the data points, as is does not fit well to our best fit line.

Outliers can dramatically effect models, often they are simply removed from a data set and deemed an incorrect value. However, before removing an outlier, bear in mind an outlier might be a surprising piece of information, a golden nugget, that gives new insight into your analysis.

 

Leave a Reply

Your email address will not be published. Required fields are marked *