Multiple Linear Regression and R Step Function

We use regression to build a model that predicts the quantitative value of ‘y’, by using the quantitative value of ‘x’, or more than one ‘x’.

Simple linear regression uses exactly one ‘x’ variable to estimate the value of the ‘y’ variable.

Multiple linear regression uses more than one ‘x’ variable (independent) to estimate the ‘y’ variable (dependant).

As most data sets in the ‘real world’ contain more one independent variable, it is more likely that you will be using multiple linear regression than simple linear regression.

Multiple linear regression has both strengths and weaknesses.

One of its strength is it is easy to understand as it is an extension of simple linear regression. It is therefore by far the most common approach to modelling numeric data. Another strength is it provides estimates of the size and the strength of the relationships among the features and the outcomes.

Its weaknesses are that it makes assumptions about the data i.e. normal distribution and it does not work well with missing data. Another constraint is that is only works with numeric features, so categorical data features require extra processing, such as conversion to numeric values.

Assumptions of Multi Linear Regression Analysis

The main assumptions in multi linear regression analysis are as follow:

  • Numeric data – all data must be in numeric format
  • Normal distribution of the multivariate data
  • No missing values
  • Linear relationship (homoscedacity) – that the relationship between the dependant variable and the independent variables follow a linear pattern i.e. no outliers
  • No multicolinearality – if two variables are highly correlated it can be problematic and the linear model should only include one of the variables

Evaluating a Multi Linear Regression Model

There are three main criteria concerned with evaluating a multi linear regression model:

  • The goodness of fit line – to check the goodness of fit we are interested in residuals and how far away they are from the line. The residual is the true value minus the predicted value.
  • The significance level – in r the significance of a variable is easy to recognise as is denoted by ***, **, *, etc. the presence of three stars ***, indicates that it is highly unlikely that this variable is unrelated to the dependant variable. Common practice is to use a significance level of 0.05 to denote a statistically significant variable. In the absence of r highlighting these for us, we should compare our p-value, we want our p-value < 0.05.
  • The co-efficient of determination – this is our multiple R-squared value, the closer our R-squared value is to 1, the better the fit of the model to the data. In models with large numbers of independent variables we use the Adjusted R-squared value.

A Multi Linear Regression Example

Predicting a model for life expectancy

First we will read in the data file and visualise the data

stateX77 <- read.csv( ‘state_data_77.csv’, header=T )







In the plots above we can see some relationships, both positive and negative, between variables.

We can look at a correlation table to look at the actual figures.


There are 9 variables and the first column is not numeric. So we will look at columns 2 to 9.

To make the table appear neater we can round the numbers:

round( cor(stateX77[,2:9]), 2)


Some Observations:

murder and illit, fairly high positive correlation: 0.70

murder and Lifeexp, high negative correlation: -0.78

HSGrad and Income have fairly high positive correlation: 0.62

HSGrad and Illit, fairly high negative correlation: -0.66

Frost and Illit, fairly high negative correlation: -0.67

Now, build the model and inspect it

Life expectancy is our dependant variable.

fit1 <- lm( LifeExp ~ Popul + Income + Illit + Murder + HSGrad + Frost + Area, data=stateX77)

summary( fit1 )


R is using stars highlighting some significant variables. We can see from the model that some variables are totally insignificant, Income, Illit, Area, these have very high p-values. We will create a new model without these variables.

Note the model has a decent R-squared value.

fit2 <- lm( LifeExp ~ Popul + Murder + HSGrad + Frost, data=stateX77)



This model is showing us that Murder, HSGrad and Frost are significant in Life Expectancy and Popul, is marginally significant.

Note the model has a decent R-squared value.

Using R Step to find best fit model

R has a step function that can be used to determine best fit models. The step function runs thought the models one at a time, dropping insignificant variables each time until it has found its best solution.

We will use the step function to validate our findings. R will start with our original model with all variables.

rFit <- step( fit1 )


R has indicated Popul, Murder, HSGrad and Frost as the most relevant variables, this concurs with our model.

Checking the residuals

In our models we have evaluated our significance values and our R-square value, these are both looking good for our model.

Now we must check the residuals for our best model, fit2.

qqnorm( fit2$residuals )

qqnorm( fit2$residuals, ylab=”Residuals” )

qqline( fit2$residuals, col=”red” )


The QQ Plot is showing that the data points are not fitting well along the line. This would indicate a question mark over whether the data is normally distributed.


Despite the fact that we have a relatively high adjusted R-Squared value of 0.71%

The residual plot suggests there is some bias in our model.

Our conclusions are that the residuals show some departure from a normal distribution and we need to revise the model further.

K-Means and KNN Investigates


In computer science Machine Learning evolved as the study of pattern recognition and computational learning theory in artificial intelligence.

The aim of machine learning is to construct models, using algorithms that can learn from a dataset, to enable predictions to be made on unseen data.

K-Means aims to cluster data into various types or groups. It is an unsupervised learning in that, its purpose is to discover the structure of the inputs and determine hidden patterns in the data.

K-Nearest Neighbours is a supervised learning as it uses a given type, often called a label, from the dataset. KNN learns about the data set and uses the label to make predictions on unseen data.


The task is to find a dataset suitable for K-means cluster analysis and K-Nearest Neighbour predictions.

The data set I have chosen is weather recordings at Heathrow (London Airport).

The question I would like to answer is:

“Are there really four seasons?”

NB: This is a UK weather dataset

Data Source:

Data Set Information:

It is a true data set.

The data set comes from

Heathrow (London Airport) Location 507800E 176700N, Lat 51.479 Lon -0.449, 25m amsl

Data includes:

By month:

  • Mean maximum temperature (tmax)
  • Mean minimum temperature (tmin)
  • Days of air frost (af)
  • Total rainfall (rain)
  • Total sunshine duration (sun)


Visualise, explore and clean data

Data Issues to be addressed:

  • Missing data (more than 2 days missing in month) is marked by —.
  • Data Runs from Jan1948 – Jan2016
  • Estimated Data is marked with *
  • Sunshine data taken from an automatic Kipp & Zonen sensor is marked with #, otherwise the data is taken from a Campbell Stokes recorder.

Original dataset The dataset contains 7 Variables and 818 observations.

Final dataset contains 7 variables and 708 observations (59 years of data)

Tools used R and Excel

Variable Missing Values Comments – before cleaning data Comments   after cleaning data
yyyy NO Years 1948-1956 have data for af days and sun hours as “—“ : delete

Year 2016 has 1 month – take out to keep complete years

Years 1948-1956 deleted
mm No Months listed in numerical order : no issues
tmax degC


No No issues
tmin degC


No No issues
af days


No Data for year 1948 listed as “—“ : delete

48 (0.05%) values have strange data, spaces between numbers, looks like data entry error

“—“ :Deleted in Excel


Small %, so corrected input using best judgement

rain mm


No No issues
sun hours


No Data for years 1948 – 1956 listed as “—“ : delete

Some values have #: need to be cleaned

Years 1948-1956 deleted

# removed

type No New column added To enable clusters to be tested.

K-Means Clustering – Overview of Process

Reference Tutorial


1. Read in data

2. Install libraries

3. Visualise the data


4. As the range of the values varies, Standardise the values, (taking out column 1&2)


5. Create the function to create the clusters and run

6. Output suggests 2 clusters



7. In our dataset the data is distinguished by month, 1 for January, 2 for February etc, so there are 12 types, which are not suitable for 2 clusters. I had suggested seasons which would be 4 clusters, this is not suitable either, so:

8. I used a common sense approach to assigning a type value1 and 2 to the dataset, as the data lends itself nicely for this, I assigned type 1 to Nov, Dec, Jan, Feb and Mar and type 2 to Apr, May, Jun, Jul, Aug, Sep and Oct.

However if you have no idea about the data or how to distinguish the types you can use the cluster data aggregate output to create an attribute type in the dataset and assign a cluster value:


9. Test the fit of your clusters to see how accurate they are.

The output table below, shows that cluster 1, type 1 has 290 correct and 5 wrong and cluster 2, type 2 has 391 correct and 22 wrong. Seems like a pretty good model.


10. Re-confirm your findings by running ‘randindex’ which will give a % fit. This is suggesting 85% fit which is good.


K-Means Clustering – Summary

To summarize the finding of this data set according to K-means, we have 2 types of cluster, so in answer to my question:

“Are there really four seasons?”

No, we have 2 seasons:

  • Season 1 – Nov, Dec, Jan, Feb and Mar
  • Season 2 – Apr, May, Jun, Jul, Aug, Sep and Oct


K-Nearest Neighbour – Overview of Process

Reference tutorial 


1. Read in data

2. Install libraries

3. Visualise the data


4. Normalise the data: (type 1 is now type 0 and type 2 is now type 1)


5. Create a training and test data set with 2/3, 1/3 split

6. Check output of test data set for effectiveness


7. This model is showing only 6 errors for both types, with 94% and 96% accuracy, for each type respectively. This suggests that this is a good model.

K-Nearest Neighbour – Summary

To summerise the finding of this data set according to K-Nearest Neighbour, we have tested the 2 types of cluster, using a training and test data set. The model is a good fit so we can conclude:

  • Season 1 – Nov, Dec, Jan, Feb and Mar,
  • Season 2 – Apr, May, Jun, Jul, Aug, Sep and Oct


I posed the question “Are there really four seasons?” I used this dataset that has 59 years of data, which has monthly averages for various variables.

The tests that I ran suggest that in fact we have 2 seasons:

Season 1 – Nov, Dec, Jan, Feb and Mar

Season 2 – Apr, May, Jun, Jul, Aug, Sep and Oct

In reality, yes, it does feel like our seasons are no longer 4 distinct seasons, as one season feels like it rolls into the next, with the winter months being more distinguishable from the rest.

However, I tested these models using all available variables, given more time, I would probably test these models using different combinations of variable, eg: just temperature, just to see how they affect the output of clusters.

R Code and comments:

K-Means Clustering

#data <- read.csv(“heathrowdata1Type.csv”, header = TRUE)
data <- read.csv(“heathrowdata1Type_reverse.csv”, header = TRUE)

# Get a feel for the data

wssplot <- function(data, nc=15, seed=1234){
  wss <- (nrow(data)-1)*sum(apply(data,2,var))
  for (i in 2:nc){
    wss[i] <- sum(kmeans(data, centers=i)$withinss)}
  plot(1:nc, wss, type=”b”, xlab=”Number of Clusters”,
       ylab=”Within groups sum of squares”)}

# Standardized unit – data values have varying ranges
# also taking out column 1 mm and type
df <- scale(data[-1:-2])

# determine number of clusters
# every knink in the graph shows a potential cluster of data, showing 15 clusters
# best number of clusters at elbow on graph
# identifying 2 as best number of clusters

# NbClust – has the clustering algorithms
# setting random number generator

# telling it we want min 2 clusters max 15, use kmean algorithm to determine which clusters to use.
nc <- NbClust(df,,, method=”kmeans”)
# see two graphs show – the out put According to the majority rule, the best number of clusters is 2


# 2 clusters
# Visualise bar chart
        xlab=”Numer of Clusters”, ylab=”Number of Criteria”,
        main=”Number of Clusters Chosen by 26 Criteria”)

# set.seed allows you to set the seed for the random number generator
set.seed(1234) <- kmeans(df, 2, nstart=25) # starts at 25 clusters and works its way back to 2 clusters$size – # this outputs 2 clusters, of value 312 and 396, suggesting to me 1 cluster taking in more months!$centers # this prints out the centre point of each cluster, if we look the centre point of each
#we can see a definate difference in each cluster.

# show the mean of each variable in each cluster
aggregate(data[-1:-2], by=list($cluster), mean)

# show table of correct and incorrect count <- table(data$type,$cluster)

# OUTPUTS 1 and 2 (heathrowdata1Type.csv)LOOK WRONG WAY AROUND – change the types in the dataset to opposite way around
# OUTPUT 1: – when type 2 – jan, feb, nov, dec, type 1 the rest
#   1   2
#1  76 394
#2 236   2

# OUTPUT 2:  – when type 2 – jan, feb, mar, nov, dec, type 1 the rest.
#   1   2
#1  22 389
#2 290   7

# AFTER TYPE CHANGED AROUND (heathrowdata1Type_reverse.csv) – table reads correct now.
# OUTPUT 3: – when type 1 – jan, feb, mar, nov, dec and type 2 the rest
#   1   2
#1 290   5
#2  22 391

# OUTPUT 1: ARI – 0.6067875 – quite good fit??
# OUTPUT 2: ARI – 0.842598 – good fit.
# OUTPUT 3: ARI – 0.8530182 – good fit

KNN – K-Nearest Neighbours

data <- read.csv(“heathrowdata1_knn.csv”, header = TRUE)


# This is the normalisation function
normalize <- function(x) {
  num <- x – min(x)
  denom <- max(x) – min(x)
  return (num/denom)

# create dataframe of normalised data
data_norm <-[1:6], normalize))

#Training And Test Sets

#To make your training and test sets, you first set a seed.
#This is a number of R’s random number generator.
#The major advantage of setting a seed is that you can get the same sequence of random numbers
#whenever you supply the same seed in the random number generator.

#Then, you want to make sure that your data set is shuffled and that you have the same ratio between types in your training and test sets. You use the sample() function to take a sample with a size that is set as the number of rows of the data set.
#You sample with replacement: you choose from a vector of 2 elements and assign either 1 or 2 to the rows of the data set.
#The assignment of the elements is subject to probability weights of of 0.67 and 0.33.
ind <- sample(2, nrow(data_norm), replace=TRUE, prob=c(0.67, 0.33))

#ind <- sample(2, nrow(data), replace=TRUE, prob=c(0.67, 0.33))
# just seeing what happens if I don’t normalise data
# I want training set to be 2/3 of the original data set:
#assign 1 with a probability of 0.67 and the other with a probability of 0.33 to the sample rows.

#You can then use the sample that is stored in the variable ind to define your training and test sets: <- data_norm[ind==1, 1:5]
data.test <- data_norm[ind==2, 1:5] <- data[ind==1, 1:5]
#data.test <- data[ind==2, 1:5]
# just seeing what happens if I don’t normalise data

#Note that, in addition to the 2/3 and 1/3 proportions specified above, you don’t take into account all attributes to form the training and test sets.
#This is because you actually want to predict the sixth attribute, type: it is your target variable.
#However, you do want to include it into the KNN algorithm, otherwise there will never be any prediction for it.
#You therefore need to store the class labels in factor vectors and divide them over the training and test sets.
# creating a blank 6th column
data.trainLabels <- data_norm[ind==1, 6]
data.testLabels <- data_norm[ind==2, 6]

#data.trainLabels <- data[ind==1, 6]
#data.testLabels <- data[ind==2, 6]
# just seeing what happens if I don’t normalise data
#The Actual KNN Model
#Building Your Classifier
#After all these preparation steps, you have made sure that all your known (training) data is stored.
#No actual model or learning was performed up until this moment. Now, you want to find the k nearest neighbors of your training set.
#An easy way to do these two steps is by using the knn() function,
#which uses the Euclidian distance measure in order to find the k-nearest neighbours to your new, unknown instance. Here, the k parameter is one that you set yourself.
#As mentioned before, new instances are classified by looking at the majority vote or weighted vote.
#In case of classification, the data point with the highest score wins the battle and the unknown
#instance receives the label of that winning data point.
#If there is an equal amount of winners, the classification happens randomly.

#Note the k parameter is often an odd number to avoid ties in the voting scores.
#To build your classifier, you need to take the knn() function and simply add some arguments to it,
#just like in this example:

data_pred <- knn(train =, test = data.test, cl = data.trainLabels, k=3)
#The result of this command is the factor vector with the predicted classes for each row of the test data.

#Evaluation of Your Model

#OUTPUT – 12 incorrect
#You see that the model makes reasonably accurate predictions,

#Then you can make a cross tabulation or a contingency table.
#This type of table is often used to understand the relationship between two variables.
#In this case, you want to understand how the classes of your test data, stored in data.testLabels relate to your model that is stored in data_pred:
CrossTable(x = data.testLabels, y = data_pred, prop.chisq=FALSE)

#Note that the last argument prop.chisq indicates whether or not the chi-square contribution of each cell is included.
#The chi-square statistic is the sum of the contributions from each of the individual cells and is used to decide whether the difference between the observed and the expected values is significant.

#From this table, you can derive the number of correct and incorrect predictions: 6 instance of 0 being 1. 6 instance of 1 being 0
#In all other cases, correct predictions were made. You can conclude that the model’s performance is good enough and that you don’t need to improve the model!

# NB: the data that was not normalised didn’t perform too bad, however the normalised data was better.







airbnb – Analysis of Variance


Airbnb Business Problem:

Where will a new guest book their first travel experience?

Instead of waking to overlooked “Do not disturb” signs, Airbnb travellers find themselves rising with the birds in a whimsical treehouse, having their morning coffee on the deck of a houseboat, or cooking a shared regional breakfast with their hosts.

New users on Airbnb can book a place to stay in 34,000+ cities across 190+ countries. By accurately predicting where a new user will book their first travel experience, Airbnb can share more personalized content with their community, decrease the average time to first booking, and better forecast demand.

In this recruiting competition, Airbnb challenges you to predict in which country a new user will make his or her first booking. Kagglers who impress with their answer (and an explanation of how they got there) will be considered for an interview for the opportunity to join Airbnb’s Data Science and Analytics Team.


Airbnb posed the above mentioned business problem.

Through the use of ANOVA we aim to see if the country of destination is affected by the other variables in the data set.

What is our Hypothesis?

H0 = No effect between country of destination and other variables

H1 = There is an effect.

Data Source:

Data Set Information:

The data set comes from

It is a true data set that Airbnb placed on Kaggle, as part of a recruitment competition.

Airbnb posed the question “Where will a new guest book their first travel experience?”

The dataset contains 16 Variables and 213451 observations.


Visualise, explore and clean data

Bring in data – Run summary and graph data: observations from outcome listed below

Original dataset contains 16 Variables and 213451 observations.

Final dataset contains 17 variables and 66707 observations

Tools used R and Excel

Variable Missing Values Comments – before cleaning data Comments   after cleaning data
Id No Customer ID – not really needed
date_account_created No Same date as timestamp_first_active
timestamp_first_active No Same date as account created. Could split data to date and time column or remove as not needed? Didn’t use this column, but time could be interesting.
date_first_booking Yes – 124530 Remove all blanks, as no booking made.

QUESTION: Is this date of booking made or Date of travel?

Cleaned – all blanks gone


Gender No Male/female/other/

unknown?? See if gender split might be relevant. What to do with unknown? (unknown- biggest)

After cleaning the above, unknown reduced, female/male similar.

Unknown – keep, too big to lose.



Yes – 87990 From age 1 to 2014

Clean data – 18-80?

Cleaned – slightly skewed
signup_method No Google/Facebook/Basic

Basic – biggest

Basic – biggest
signup_flow No
Language No en-biggest En – biggest
affiliate_channel No Direct – biggest Direct – biggest
affiliate_provider No Direct – biggest Direct – biggest
first_affiliate_tracked Yes – 6065 6065 – Might all go when date_first_booking cleaned Still 352 missing data.
signup_app No Web – biggest Web – biggest
first_device_type No Mac desktop and Window desktop – biggest More Mac than Window after clean
first_browser No Mixed outcome Still mixed
country_destination No Catagorical –

NDF biggest – will go when date_first_booking cleaned

US – next biggest

NDF – Gone

US still biggest 71%, next closest other 11%

country_code No New Column created For ANOVA

Quick observations:

US is by far the most popular destination, with en as most popular language.

Looks like most business is coming Direct to Airbnb, through the web.

Users of Apple products quite substantial



Reminder of our Hypothesis?

H0 = No effect between country of destination and other variables

H1 = There is an effect between country of destination and other variables

Outcome of ANOVA:



By looking at the stars that R has given each variable we can determine the significance, if any that the variable has on the country of destination. The more stars that appear the more significant the variable is.

Based on the definite significance, denoted by ***, **, *, that R is showing us, we reject the null hypothesis and accept the alternate, that different variables do have an effect of country of destination.

The outcome of the ANOVA test tells us that language and sign_up method have no effect on the country of destination. First_browser has some significance, first_affiliate_tracked and signup_app have more significance, but the most significant are age, gender, signup_flow, affiliate_channel, affiliate_provider and first_device_type.

If we look deeper into this we can expect that signup and affiliate variables will effect the country of destination as both types of variables are guiding the consumer in some way to the airbnb website. EG: an affiliate may be guiding a consumer to a particular destination.

Parametric or Non-Parametric?

I think it is fair to say that data within this dataset is not of normal distribution, some variables have quite normal distribution, eg: age, which we cleaned to give a better distribution but other variables definitely do not, eg: US as a destination and ‘en’ as language and that any further analysis should be dealt with using non-parametric tests.

There is a definite skew to travel in US, with 71% of bookings for US.

Language is definitely skewed to ‘en’ as language totalling 97%

Just out of curiosity I ran an ANOVA on the dataset excluding US as a destination.

The result was quite interesting:


Here is a comparison:

Variable Name Dataset with US Dataset without US
Age *** ***
Language ***
Gender *** ***
signup_flow *** ***
affiliate_channel *** **
affiliate_provider *** **
first_affiliate_tracked **
signup_app ** ***
first_device_type *** ***
first_browser * *

Note how language has now become a very significant factor, once US as a destination has been removed. The affiliate variables without the US have reduced in significance also.

To summarise there are a number of issues that I think that airbnb need to address going forward:

Data quality – There are definitely some issues with the way in which airbnb capture their data, these areas include:

    • Gender – catagories need to be defined, not sure if ‘unknown’ is computer generated?
    • Age – lower and upper limits should be put on these
    • First_affiliated_tracked – allows null values
    • Date_first_booking – is suggests the date that the booking as made, however could it be the date of travel?

No Bookings – 124543, (58%) accounts created without generating bookings. WHY?

  • There is opportunity here to incentivise a booking or an opportunity to capture further data, a quick click box as to why the customer is leaving the site eg: just browsing, booking elsewhere, will book at later date, will give some insight.

Country of Destination – It would appear that there is a difference in approach between travel to the US and the rest of the destinations, (as per ANOVA), this needs to be addressed or monitored to see if this a cultural issue, ie: are most bookings US citizens traveling within the US? Or maybe US is an established model and Airbnb is only growing its business elsewhere?Understanding this will determine whether or not the US should be dealt with as a separate model to the rest.

Language – ‘en’ is definitely an influencing factor with 97% bookings having ‘en’, as language. The next most popular languages are ‘zh’, fr’ and ‘es’, which make up 54% of the remaining 3% of bookings.

This could be investigated to see are there other factors influencing the bookings made with these languages.

Gender – An interesting observation, the gender split does show some differing trends, EG: women booking more than men for US, FR, GB and IT.

This data is useful and can be used when designing the website.

I would recommend that Airbnb capture the following data:

  • country of residence
  • country of departure
  • date of travel
  • duration of stay
  • whether the trip is one destination or multiple destinations

This data would give more insight to consumer behaviour and would be good indicators of country of destination and would be able to answer questions like: are consumers using airbnb to travel within their own countries? how far ahead are consumers booking? do consumers use Airbnb for overnight stays or holidays? are consumers using Airbnb when they are on a long trip travelling the world?


To conclude, we haven’t quite answered the question “Where will a new guest book their first travel experience?” but there is a pretty good chance it will be US.

However, we have gained some insight to the dataset and were able to make recommendations going forward to help answer this question.


R code used, with comments:

# Read in data file
airbnbData <- read.csv(“train_users_2.csv”, header = TRUE)

# Get a feel for the data

# Look for NA’s and outliers, anything odd
# OUTPUT – Question marks???
# date_first_booking: 124543 blank fields, so no booking made, take out
# gender: 4 catagories, unknown, female, male and other – gender m/f split quite not sure if I want to take out unknown?
# age: NA’s: 87990, outliers min age 1, max age 2014 – adjust to reasonable 18-80?
# first_affiliate_tracked: 6065 blank fields. Not sure this variable relevant?

# Visualise Data
plot(airbnbData$country_destination) # ‘NDF’ most – as above, these represent where no booking exists
plot(airbnbData$gender) # ‘unknown’ by far greatest
plot(airbnbData$age) # visual all over place as expected
hist(airbnbData$age) # data needs cleaning
plot(airbnbData$signup_method) # ‘basic’ most popular by far
plot(airbnbData$language) # ‘en’ most popular by far
plot(airbnbData$affiliate_channel) # ‘direct’ most popular by far
plot(airbnbData$affiliate_provider) # ‘direct’ most popular by far, followed by Google
plot(airbnbData$first_affiliate_tracked) # check this after blanks taken out
plot(airbnbData$signup_app) # ‘web’ most popular by far
plot(airbnbData$first_device_type) # Mac Desktop’ and ‘Windows Desktop’ largest
plot(airbnbData$first_browser) # mixed
# create new dataset – clean to have: only bookings and deal with age
airbnbNew <- airbnbData

# could use these to take out unneeded variables if file too big!
# take out id
#airbnbNew[1] <- NULL

# take out timestamp
#airbnbNew[2] <- NULL

# remove all rows with no date_first_booking – should result in 88908 obs (ref: Excel Clean Data)
airbnbNew <- subset(airbnbData, date_first_booking != “” )

# making sure all NDF are gone

# give age an appropriate range – should result in 67059 obs (ref: Excel Clean Data)
airbnbNew <- subset(airbnbNew, age >= 18 & age <= 80)

# OUTPUT – compare to first summary
# date_first_booking: blank fields gone now
# gender: 4 catagories, unknown, female, male and other.
# age: NA’s: all gone, min 18, max 80, mean 36, as expected.
# first_affiliate_tracked: only 352 blank fields now.(originally 6065 blank fields). Take out

# Take out blank first_affiliate_tracked
airbnbNew <- subset(airbnbNew, first_affiliate_tracked != “” )

plot(airbnbNew$gender) # unknown much smaller, female most, close behind male
hist(airbnbNew$age) #this data is slightly skewed. (Excel : age 18 to 60 is less skewed, more normal)
plot(airbnbNew$signup_method) # still ‘basic’ most popular by far
plot(airbnbNew$language) # still ‘en’ most popular by far
plot(airbnbNew$affiliate_channel) # still ‘direct’ most popular by far
plot(airbnbNew$affiliate_provider) # still ‘direct’ most popular by far, followed by Google
plot(airbnbNew$signup_app) # still ‘web’ most popular by far
plot(airbnbNew$first_device_type) #  still Mac Desktop’ and ‘Windows Desktop’ largest, Bigger gap now.
plot(airbnbNew$first_browser) # still mixed

# Create new dataset and Create a new column with country_code – for ANOVA
airbnbClean <-airbnbNew


# Create ANOVA model
model_1 <- aov(country_code~age+language+gender+signup_method+signup_flow+affiliate_channel+affiliate_provider+first_affiliate_tracked+signup_app+first_device_type+first_browser,data = airbnbClean)
# language, signup_method – no significance, take out

model_2 <- aov(country_code~age+gender+signup_flow+affiliate_channel+affiliate_provider+first_affiliate_tracked+signup_app+first_device_type+first_browser,data = airbnbClean)

# Curious??? Feel that US is having effect on these models
# Take out US to see the difference
no_us_data <- airbnbClean
no_us_data <- subset(no_us_data, country_destination != “US” )

model_no_us <- aov(country_code~age+language+gender+signup_method+signup_flow+affiliate_channel+affiliate_provider+first_affiliate_tracked+signup_app+first_device_type+first_browser,data = no_us_data)
# OUTPUT – change in significance levels.
















Checking Regression Assumptions – Residuals

In simple linear regression we look at the relationship between two variables, we use this relationship to make predictions.

A regression model for a sample population uses results to produce a best fit line for a dataset.

However, if we were to take a new dataset from the same population, the best fit line may not be true.

How well does our model fit?

We create a regression line in order to build a simple linear regression model.

However two conditions must be met before we can apply our model to a dataset:

  1. The y value must have an approximately normal distribution for each value of x.
  2. The y value must have a constant amount of spread (standard deviation) for each value of x.

It is important to check that these conditions are truly met. In effect we are trying to detect anomalies in our data set , i.e.: outliers

Residuals – What are they?

Residuals allow us to see how far off our predications are from the actual data that came in.

To check to see whether the y values come from a normal distribution we need to measure how far the predictions are from the actual data.

A residual, or also referred to as errors, is the difference between the predicted value from the best fit line and the observed value.

If the residual value is large it will not fit well to the line, it will appear some distance from the line. If the residual value is small it will appear close to the line.

Let’s look at an example

Create a dataset

data1 <- c(48.5, 54.5,61.25,69,74.5,85,89,99,112,123,134,142)

data2 <- c( 8,9.44,10.08,11.81,12.28,13.61,15.13,15.47,17.36,18.07,20.79,16.06)

Plot the data to visualize

plot( data1, data2)


Create and view a linear model

fit1 <- lm( data2 ~ data1 )



lm(formula = data2 ~ data1)


(Intercept)   data1

3.6938         0.1134

Let’s look at the correlation value

cor(data1, data2)


The correlation value between the two data sets is showing a strong positive relationship

We will create a regression line to visualise this further



From the graph we can see that most data points fit closely to the line, however we can see a couple of data points with more distance from the line. Possible outliers?

To investigate this further we will now look at our residuals

First, create residual values for our model fit1


Next, make them easier to see, reducing output to 2 decimal places

res <- signif(residuals(fit1), 2)



We can see from the output that the residuals range from -3.70 to 1.90.

To put the residual values into perspective

We will standardize the values and visualize them, using various methods

fit1.stdRes = rstandard( fit1 )  

Using a normality plot

qqnorm(fit1.stdRes,ylab=”Standardized Residuals”,xlab=”Normal Scores”, main=”Normality Plot”, col=”red”)



We can see that most data points are close to the line, however one in particular is quite far.

Using an abline

plot( fit1.stdRes, col=”red”)abline(0,0)


Using standardized residuals we can really get a good understanding of how far the data points are from the line.

Using an Histogram

hist( fit1.stdRes, col=”red”)


Our histogram is indicting that there is not a normal distribution in our data.


In our initial observation of our data set, we suspected a possible outlier.

Using this example we have concluded that there is a definite need to investigate one of the data points, as is does not fit well to our best fit line.

Outliers can dramatically effect models, often they are simply removed from a data set and deemed an incorrect value. However, before removing an outlier, bear in mind an outlier might be a surprising piece of information, a golden nugget, that gives new insight into your analysis.



Looking at relationships and making predictions

Looking at relationships and making predictions is one of the main tasks of data analysis.

Using correlation and simple linear regression we can look at the relationship of two variables.

Using the following dataset we can demonstrate this process.

The dataset consists of two quantitative variables, the number of cricket chirps and the temperature. There are 10 observations.

Using R we will first create our two datasets:

  • chirp_nums <- c( 18, 20, 21, 23, 27, 30, 34, 39)
  • temp_f <- c( 57, 60, 64, 65, 68, 71, 74, 77)

As we learnt from Anscombe lets visualise our data, lets use a scatterplot:

  • plot( temp_f, chirp_nums)

Visualising allows us to see the shape and dispersion of the data.

The scatterplot for our dataset is showing the data moving in an uphill pattern, from left to right, this suggests a positive              correlation.


If the data is moving in a downhill pattern from left to right, this suggests a negative correlation.

If the data doesn’t seem to show any kind of pattern, then no       relationship exists between the two variables.

We can use R to calculate the correlation between chirps and temperature:

  • cor( temp_f, chirp_nums)

output:  r = 0.9774791

This correlation value for the dataset also indicates that there is a strong positive correlation between these two variables.

Understanding Correlation

Correlation is a statistical technique used to determine to what   degree two variables are related.

The sign r is used to denote Pearson’s correlation coefficient, r    denotes the strength of the association or relationship.

If the sign is +, the relationship is positive, so an increase in one variable is associated with an increase in the other variable and a decrease in one variable is associated with a decrease in the other variable.

If the sign is -, the relationship is negative, so an increase in one variable is associated with a decrease in the other variable.


It is important to note that positive correlation implies that there is a relationship between the two quantitative variables, however, it does not imply causation, ie: one variable does not cause the change in the other, so in this example the number of chirps does not cause the temperature to change.

Lets create a linear model, called fit, then lets view fit:

  • fit <- lm( chirp_nums ~ temp_f)
  • fit

The output gives us our coefficients, which are used to create our regression line.


(Intercept)       temp_f

-44.177       1.055

Regression is the emphasis on predicting one variable from        another variable, eg: if the temperature is 70 degrees, how many chirps will there be?

Lets visualise our regression line, first plot our graph, then add our line from fit:

  • plot( temp_f, chirp_nums)
  • abline(fit)

Look at the datapoints, how close they are to the line,  another      indication of a strong positive relationship.


Remember back to school, the equation of a line y = mx + c, where m is the slope and c is the y-intercept.

The slope represents the change in y that is associated with a unit change in x.

In our output above, our slope value is 1.055 and our y-intercept is -44.177, if we substitute our values into our equation we can now start predicting outcomes: y = 1.055(x) + -44.177.

Eg: If temperature is 70, how many chirps? y = 1.055(70) + -44.177 chirps = 29.673 rounded to 30.

If in doubt, look at the graph, when the temperature is 70 does chirps = 30?

Anscombe’s Quartet


So as the saying goes…

“A Picture Paints a Thousand Words”

But not for everyone……!!!

I was talking to a couple of business people recently about data and the great insights that can be gained using data to solve         business problems. During our conversation I asked them what tools they use to visualize their data, “we don’t, we don’t like graphs etc., we like numbers” was their response. Needless to say, being a student of data analytics, I was horrified and before I knew it I had turned into my lecturer and started to give them a lesson on Anscombe’s Quartet.

So going back to “a picture paints a thousand words”, a lesson to be had here, it is not immediately true for everybody, however with the wisdom of Anscombe’s Quartet, it is possible to initiate a mind-set change in people.

Anscombe’s Quartet

The statistician France Anscombe constructed the Anscombe       dataset in 1973.

Anscombe created the dataset to demonstrate the importance of visualizing data and also to highlight the effect that outliers can have on a statistical findings of a dataset.

Anscombe’s Quartet consists of four data sets, that when                examined have nearly the identical statistical properties, yet when graphed the datasets tell a very different story.

Anscombe’s Dataset

Each of the datasets in the quartet consists of 11 (x,y) points:


Statistical Properties

Each Dataset in the quartet consists of the following statistical analysis:


Pearson’s Correlation Co-efficient

In our examples we have used the Pearson’s Correlation               coefficient, which measures the strength of the linear correlation between two variables, in our case x and y, of quantitative type. The output of this is measured between +1 and -1, where 1 is a    total positive relationship, 0 denotes no relationship and -1 is a   total negative relationship.

Using this method, the closer the coefficient, r, is to +1 or -1, the stronger the association of the two variables.

With Pearson’s we need to err on the side of caution as it assumes normal distribution of the sample and it requires action to be     taken with outliers. Other models may need to be applied where these assumptions are not met. This re-enforces that need to          visualise your data.

Let’s Visualise the Data

Anscombe 1 – This graph shows a simple linear positive                 relationship. It is what we would expect to see, assuming a normal distribution.anscombe1_graph


Anscombe 2 – This graph does not appear to be normally                distributed. We can however see a relationship between the 2    variables, it appears to be quadratic or parabolic, but it is not       linear.



Anscombe 3 – This graph is showing a clear outlier in the dataset. The data points, with the exception of the outlier are showing what appears to be perfect linear relationship, but because of the outlier the value of the correlation coefficient has been reduced from 1 to 0.816.



Anscombe 4 – In this graph we can see that the value of x stays constant with the exception of one outlier. This outlier has created the same correlation coefficient as the other datasets, which is a high correlation, however the relationship between the two          variables is not linear.



In my wisdom I misunderstood that coming up with my own dataset meant creating a fifth dataset that has the same statistical values as Anscombe’s.

I was relieved however to hear that coming up with my own dataset meant coming up with four sets of data points that give the same statistical outcomes, mean, variance, correlation and linear regression line as Anscombe. This was a far simpler task and just meant using logic such as doubling each data point. That almost seemed too simple so I tried adding 1 to each data point in Anscombe, both ways worked as follows. I might note that dataset 4 in both instances has a slightly different correlation of 0.817, instead of 0.816, because of rounding, but it is not significant.

Anscombe’s Dataset Data Points + 1








Here we go….

My Plan was to use the equation of the line, to generate y points when x was given. Which would give an exact positive linear equation, 1.




So, then I would have to adjust some y values to get the correct y variance and aim for 0.816 correlation.


I spent way to much time on this, like a dog with a bone, I tried many combinations.


Given more time I expect I would have finally got there….. but how long it would have taken….we will never know!