Looking at relationships and making predictions
Looking at relationships and making predictions is one of the main tasks of data analysis.
Using correlation and simple linear regression we can look at the relationship of two variables.
Using the following dataset we can demonstrate this process.
The dataset consists of two quantitative variables, the number of cricket chirps and the temperature. There are 10 observations.
Using R we will first create our two datasets:
- chirp_nums <- c( 18, 20, 21, 23, 27, 30, 34, 39)
- temp_f <- c( 57, 60, 64, 65, 68, 71, 74, 77)
As we learnt from Anscombe lets visualise our data, lets use a scatterplot:
- plot( temp_f, chirp_nums)
Visualising allows us to see the shape and dispersion of the data.
The scatterplot for our dataset is showing the data moving in an uphill pattern, from left to right, this suggests a positive correlation.
If the data is moving in a downhill pattern from left to right, this suggests a negative correlation.
If the data doesn’t seem to show any kind of pattern, then no relationship exists between the two variables.
We can use R to calculate the correlation between chirps and temperature:
- cor( temp_f, chirp_nums)
output: r = 0.9774791
This correlation value for the dataset also indicates that there is a strong positive correlation between these two variables.
Correlation is a statistical technique used to determine to what degree two variables are related.
The sign r is used to denote Pearson’s correlation coefficient, r denotes the strength of the association or relationship.
If the sign is +, the relationship is positive, so an increase in one variable is associated with an increase in the other variable and a decrease in one variable is associated with a decrease in the other variable.
If the sign is -, the relationship is negative, so an increase in one variable is associated with a decrease in the other variable.
It is important to note that positive correlation implies that there is a relationship between the two quantitative variables, however, it does not imply causation, ie: one variable does not cause the change in the other, so in this example the number of chirps does not cause the temperature to change.
Lets create a linear model, called fit, then lets view fit:
- fit <- lm( chirp_nums ~ temp_f)
The output gives us our coefficients, which are used to create our regression line.
Regression is the emphasis on predicting one variable from another variable, eg: if the temperature is 70 degrees, how many chirps will there be?
Lets visualise our regression line, first plot our graph, then add our line from fit:
- plot( temp_f, chirp_nums)
Look at the datapoints, how close they are to the line, another indication of a strong positive relationship.
Remember back to school, the equation of a line y = mx + c, where m is the slope and c is the y-intercept.
The slope represents the change in y that is associated with a unit change in x.
In our output above, our slope value is 1.055 and our y-intercept is -44.177, if we substitute our values into our equation we can now start predicting outcomes: y = 1.055(x) + -44.177.
Eg: If temperature is 70, how many chirps? y = 1.055(70) + -44.177 chirps = 29.673 rounded to 30.
If in doubt, look at the graph, when the temperature is 70 does chirps = 30?