Multiple Linear Regression and R Step Function

We use regression to build a model that predicts the quantitative value of ‘y’, by using the quantitative value of ‘x’, or more than one ‘x’.

Simple linear regression uses exactly one ‘x’ variable to estimate the value of the ‘y’ variable.

Multiple linear regression uses more than one ‘x’ variable (independent) to estimate the ‘y’ variable (dependant).

As most data sets in the ‘real world’ contain more one independent variable, it is more likely that you will be using multiple linear regression than simple linear regression.

Multiple linear regression has both strengths and weaknesses.

One of its strength is it is easy to understand as it is an extension of simple linear regression. It is therefore by far the most common approach to modelling numeric data. Another strength is it provides estimates of the size and the strength of the relationships among the features and the outcomes.

Its weaknesses are that it makes assumptions about the data i.e. normal distribution and it does not work well with missing data. Another constraint is that is only works with numeric features, so categorical data features require extra processing, such as conversion to numeric values.

Assumptions of Multi Linear Regression Analysis

The main assumptions in multi linear regression analysis are as follow:

  • Numeric data – all data must be in numeric format
  • Normal distribution of the multivariate data
  • No missing values
  • Linear relationship (homoscedacity) – that the relationship between the dependant variable and the independent variables follow a linear pattern i.e. no outliers
  • No multicolinearality – if two variables are highly correlated it can be problematic and the linear model should only include one of the variables

Evaluating a Multi Linear Regression Model

There are three main criteria concerned with evaluating a multi linear regression model:

  • The goodness of fit line – to check the goodness of fit we are interested in residuals and how far away they are from the line. The residual is the true value minus the predicted value.
  • The significance level – in r the significance of a variable is easy to recognise as is denoted by ***, **, *, etc. the presence of three stars ***, indicates that it is highly unlikely that this variable is unrelated to the dependant variable. Common practice is to use a significance level of 0.05 to denote a statistically significant variable. In the absence of r highlighting these for us, we should compare our p-value, we want our p-value < 0.05.
  • The co-efficient of determination – this is our multiple R-squared value, the closer our R-squared value is to 1, the better the fit of the model to the data. In models with large numbers of independent variables we use the Adjusted R-squared value.

A Multi Linear Regression Example

Predicting a model for life expectancy

First we will read in the data file and visualise the data

stateX77 <- read.csv( ‘state_data_77.csv’, header=T )

str(stateX77)

mlr_str

head(stateX77)

mlr_head

pairs(stateX77)

mlr_pairs

In the plots above we can see some relationships, both positive and negative, between variables.

We can look at a correlation table to look at the actual figures.

cor(stateX77[,2:9])

There are 9 variables and the first column is not numeric. So we will look at columns 2 to 9.

To make the table appear neater we can round the numbers:

round( cor(stateX77[,2:9]), 2)

mlr_cor

Some Observations:

murder and illit, fairly high positive correlation: 0.70

murder and Lifeexp, high negative correlation: -0.78

HSGrad and Income have fairly high positive correlation: 0.62

HSGrad and Illit, fairly high negative correlation: -0.66

Frost and Illit, fairly high negative correlation: -0.67

Now, build the model and inspect it

Life expectancy is our dependant variable.

fit1 <- lm( LifeExp ~ Popul + Income + Illit + Murder + HSGrad + Frost + Area, data=stateX77)

summary( fit1 )

mlr_fit1

R is using stars highlighting some significant variables. We can see from the model that some variables are totally insignificant, Income, Illit, Area, these have very high p-values. We will create a new model without these variables.

Note the model has a decent R-squared value.

fit2 <- lm( LifeExp ~ Popul + Murder + HSGrad + Frost, data=stateX77)

summary(fit2)

mlr_fit2

This model is showing us that Murder, HSGrad and Frost are significant in Life Expectancy and Popul, is marginally significant.

Note the model has a decent R-squared value.

Using R Step to find best fit model

R has a step function that can be used to determine best fit models. The step function runs thought the models one at a time, dropping insignificant variables each time until it has found its best solution.

We will use the step function to validate our findings. R will start with our original model with all variables.

rFit <- step( fit1 )

mlr_step

R has indicated Popul, Murder, HSGrad and Frost as the most relevant variables, this concurs with our model.

Checking the residuals

In our models we have evaluated our significance values and our R-square value, these are both looking good for our model.

Now we must check the residuals for our best model, fit2.

qqnorm( fit2$residuals )

qqnorm( fit2$residuals, ylab=”Residuals” )

qqline( fit2$residuals, col=”red” )

mlr_qqplot

The QQ Plot is showing that the data points are not fitting well along the line. This would indicate a question mark over whether the data is normally distributed.

Conclusion

Despite the fact that we have a relatively high adjusted R-Squared value of 0.71%

The residual plot suggests there is some bias in our model.

Our conclusions are that the residuals show some departure from a normal distribution and we need to revise the model further.