In simple linear regression we look at the relationship between two variables, we use this relationship to make predictions.
A regression model for a sample population uses results to produce a best fit line for a dataset.
However, if we were to take a new dataset from the same population, the best fit line may not be true.
How well does our model fit?
We create a regression line in order to build a simple linear regression model.
However two conditions must be met before we can apply our model to a dataset:
- The y value must have an approximately normal distribution for each value of x.
- The y value must have a constant amount of spread (standard deviation) for each value of x.
It is important to check that these conditions are truly met. In effect we are trying to detect anomalies in our data set , i.e.: outliers
Residuals – What are they?
Residuals allow us to see how far off our predications are from the actual data that came in.
To check to see whether the y values come from a normal distribution we need to measure how far the predictions are from the actual data.
A residual, or also referred to as errors, is the difference between the predicted value from the best fit line and the observed value.
If the residual value is large it will not fit well to the line, it will appear some distance from the line. If the residual value is small it will appear close to the line.
Let’s look at an example
Create a dataset
data1 <- c(48.5, 54.5,61.25,69,74.5,85,89,99,112,123,134,142)
data2 <- c( 8,9.44,10.08,11.81,12.28,13.61,15.13,15.47,17.36,18.07,20.79,16.06)
Plot the data to visualize
plot( data1, data2)
Create and view a linear model
fit1 <- lm( data2 ~ data1 )
lm(formula = data2 ~ data1)
Let’s look at the correlation value
The correlation value between the two data sets is showing a strong positive relationship
We will create a regression line to visualise this further
From the graph we can see that most data points fit closely to the line, however we can see a couple of data points with more distance from the line. Possible outliers?
To investigate this further we will now look at our residuals
First, create residual values for our model fit1
Next, make them easier to see, reducing output to 2 decimal places
res <- signif(residuals(fit1), 2)
We can see from the output that the residuals range from -3.70 to 1.90.
To put the residual values into perspective
We will standardize the values and visualize them, using various methods
fit1.stdRes = rstandard( fit1 )
Using a normality plot
qqnorm(fit1.stdRes,ylab=”Standardized Residuals”,xlab=”Normal Scores”, main=”Normality Plot”, col=”red”)
We can see that most data points are close to the line, however one in particular is quite far.
Using an abline
plot( fit1.stdRes, col=”red”)abline(0,0)
Using standardized residuals we can really get a good understanding of how far the data points are from the line.
Using an Histogram
hist( fit1.stdRes, col=”red”)
Our histogram is indicting that there is not a normal distribution in our data.
In our initial observation of our data set, we suspected a possible outlier.
Using this example we have concluded that there is a definite need to investigate one of the data points, as is does not fit well to our best fit line.
Outliers can dramatically effect models, often they are simply removed from a data set and deemed an incorrect value. However, before removing an outlier, bear in mind an outlier might be a surprising piece of information, a golden nugget, that gives new insight into your analysis.