Abstract
In computer science Machine Learning evolved as the study of pattern recognition and computational learning theory in artificial intelligence.
The aim of machine learning is to construct models, using algorithms that can learn from a dataset, to enable predictions to be made on unseen data.
KMeans aims to cluster data into various types or groups. It is an unsupervised learning in that, its purpose is to discover the structure of the inputs and determine hidden patterns in the data.
KNearest Neighbours is a supervised learning as it uses a given type, often called a label, from the dataset. KNN learns about the data set and uses the label to make predictions on unseen data.
Introduction
The task is to find a dataset suitable for Kmeans cluster analysis and KNearest Neighbour predictions.
The data set I have chosen is weather recordings at Heathrow (London Airport).
The question I would like to answer is:
“Are there really four seasons?”
NB: This is a UK weather dataset
Data Source:
Data Set Information:
It is a true data set.
The data set comes from https://data.gov.uk/dataset/historicmonthlymeteorologicalstationdata/resource/16d65b07aade4585b8123ec0d05434d6
Heathrow (London Airport) Location 507800E 176700N, Lat 51.479 Lon 0.449, 25m amsl
Data includes:
By month:
 Mean maximum temperature (tmax)
 Mean minimum temperature (tmin)
 Days of air frost (af)
 Total rainfall (rain)
 Total sunshine duration (sun)
Data
Visualise, explore and clean data
Data Issues to be addressed:
 Missing data (more than 2 days missing in month) is marked by —.
 Data Runs from Jan1948 – Jan2016
 Estimated Data is marked with *
 Sunshine data taken from an automatic Kipp & Zonen sensor is marked with #, otherwise the data is taken from a Campbell Stokes recorder.
Original dataset The dataset contains 7 Variables and 818 observations.
Final dataset contains 7 variables and 708 observations (59 years of data)
Tools used R and Excel
Variable  Missing Values  Comments – before cleaning data  Comments after cleaning data 
yyyy  NO  Years 19481956 have data for af days and sun hours as “—“ : delete
Year 2016 has 1 month – take out to keep complete years 
Years 19481956 deleted 
mm  No  Months listed in numerical order : no issues  
tmax degC

No  No issues  
tmin degC

No  No issues  
af days

No  Data for year 1948 listed as “—“ : delete
48 (0.05%) values have strange data, spaces between numbers, looks like data entry error 
“—“ :Deleted in Excel
Small %, so corrected input using best judgement 
rain mm

No  No issues  
sun hours

No  Data for years 1948 – 1956 listed as “—“ : delete
Some values have #: need to be cleaned 
Years 19481956 deleted
# removed 
type  No  New column added  To enable clusters to be tested. 
KMeans Clustering – Overview of Process
Reference Tutorial
 http://www.rstatistics.com/2013/08/kmeansclusteringfromrinaction/
1. Read in data
2. Install libraries
3. Visualise the data
4. As the range of the values varies, Standardise the values, (taking out column 1&2)
5. Create the function to create the clusters and run
6. Output suggests 2 clusters
7. In our dataset the data is distinguished by month, 1 for January, 2 for February etc, so there are 12 types, which are not suitable for 2 clusters. I had suggested seasons which would be 4 clusters, this is not suitable either, so:
8. I used a common sense approach to assigning a type value1 and 2 to the dataset, as the data lends itself nicely for this, I assigned type 1 to Nov, Dec, Jan, Feb and Mar and type 2 to Apr, May, Jun, Jul, Aug, Sep and Oct.
However if you have no idea about the data or how to distinguish the types you can use the cluster data aggregate output to create an attribute type in the dataset and assign a cluster value:
9. Test the fit of your clusters to see how accurate they are.
The output table below, shows that cluster 1, type 1 has 290 correct and 5 wrong and cluster 2, type 2 has 391 correct and 22 wrong. Seems like a pretty good model.
10. Reconfirm your findings by running ‘randindex’ which will give a % fit. This is suggesting 85% fit which is good.
KMeans Clustering – Summary
To summarize the finding of this data set according to Kmeans, we have 2 types of cluster, so in answer to my question:
“Are there really four seasons?”
No, we have 2 seasons:
 Season 1 – Nov, Dec, Jan, Feb and Mar
 Season 2 – Apr, May, Jun, Jul, Aug, Sep and Oct
KNearest Neighbour – Overview of Process
Reference tutorial
 https://www.datacamp.com/community/tutorials/machinelearninginr
1. Read in data
2. Install libraries
3. Visualise the data
4. Normalise the data: (type 1 is now type 0 and type 2 is now type 1)
5. Create a training and test data set with 2/3, 1/3 split
6. Check output of test data set for effectiveness
7. This model is showing only 6 errors for both types, with 94% and 96% accuracy, for each type respectively. This suggests that this is a good model.
KNearest Neighbour – Summary
To summerise the finding of this data set according to KNearest Neighbour, we have tested the 2 types of cluster, using a training and test data set. The model is a good fit so we can conclude:
 Season 1 – Nov, Dec, Jan, Feb and Mar,
 Season 2 – Apr, May, Jun, Jul, Aug, Sep and Oct
CONCLUSION
I posed the question “Are there really four seasons?” I used this dataset that has 59 years of data, which has monthly averages for various variables.
The tests that I ran suggest that in fact we have 2 seasons:
Season 1 – Nov, Dec, Jan, Feb and Mar
Season 2 – Apr, May, Jun, Jul, Aug, Sep and Oct
In reality, yes, it does feel like our seasons are no longer 4 distinct seasons, as one season feels like it rolls into the next, with the winter months being more distinguishable from the rest.
However, I tested these models using all available variables, given more time, I would probably test these models using different combinations of variable, eg: just temperature, just to see how they affect the output of clusters.
R Code and comments:
KMeans Clustering
#data < read.csv(“heathrowdata1Type.csv”, header = TRUE)
data < read.csv(“heathrowdata1Type_reverse.csv”, header = TRUE)
# Get a feel for the data
head(data)
str(data)
summary(data)
wssplot < function(data, nc=15, seed=1234){
wss < (nrow(data)1)*sum(apply(data,2,var))
for (i in 2:nc){
set.seed(seed)
wss[i] < sum(kmeans(data, centers=i)$withinss)}
plot(1:nc, wss, type=”b”, xlab=”Number of Clusters”,
ylab=”Within groups sum of squares”)}
# Standardized unit – data values have varying ranges
# also taking out column 1 mm and type
df < scale(data[1:2])
head(df)
# determine number of clusters
wssplot(df)
# every knink in the graph shows a potential cluster of data, showing 15 clusters
# best number of clusters at elbow on graph
# identifying 2 as best number of clusters
# NbClust – has the clustering algorithms
install.packages(“NbClust”)
library(“NbClust”)
# setting random number generator
set.seed(1234)
# telling it we want min 2 clusters max 15, use kmean algorithm to determine which clusters to use.
nc < NbClust(df, min.nc=2, max.nc=15, method=”kmeans”)
# see two graphs show – the out put According to the majority rule, the best number of clusters is 2
table(nc$Best.n[1,])
# 2 clusters
# Visualise bar chart
barplot(table(nc$Best.n[1,]),
xlab=”Numer of Clusters”, ylab=”Number of Criteria”,
main=”Number of Clusters Chosen by 26 Criteria”)
# set.seed allows you to set the seed for the random number generator
set.seed(1234)
fit.km < kmeans(df, 2, nstart=25) # starts at 25 clusters and works its way back to 2 clusters
fit.km$size – # this outputs 2 clusters, of value 312 and 396, suggesting to me 1 cluster taking in more months!
fit.km$centers # this prints out the centre point of each cluster, if we look the centre point of each
#we can see a definate difference in each cluster.
# show the mean of each variable in each cluster
aggregate(data[1:2], by=list(cluster=fit.km$cluster), mean)
# show table of correct and incorrect count
ct.km < table(data$type, fit.km$cluster)
ct.km
# OUTPUTS 1 and 2 (heathrowdata1Type.csv)LOOK WRONG WAY AROUND – change the types in the dataset to opposite way around
# OUTPUT 1: – when type 2 – jan, feb, nov, dec, type 1 the rest
# 1 2
#1 76 394
#2 236 2
# OUTPUT 2: – when type 2 – jan, feb, mar, nov, dec, type 1 the rest.
# 1 2
#1 22 389
#2 290 7
# AFTER TYPE CHANGED AROUND (heathrowdata1Type_reverse.csv) – table reads correct now.
# OUTPUT 3: – when type 1 – jan, feb, mar, nov, dec and type 2 the rest
# 1 2
#1 290 5
#2 22 391
install.packages(“flexclust”)
library(“flexclust”)
#
randIndex(ct.km)
# OUTPUT 1: ARI – 0.6067875 – quite good fit??
# OUTPUT 2: ARI – 0.842598 – good fit.
# OUTPUT 3: ARI – 0.8530182 – good fit
KNN – KNearest Neighbours
data < read.csv(“heathrowdata1_knn.csv”, header = TRUE)
head(data)
install.packages(“class”)
library(“class”)
# This is the normalisation function
normalize < function(x) {
num < x – min(x)
denom < max(x) – min(x)
return (num/denom)
}
# create dataframe of normalised data
data_norm < as.data.frame(lapply(data[1:6], normalize))
head(data_norm)
summary(data_norm)
#Training And Test Sets
#To make your training and test sets, you first set a seed.
#This is a number of R’s random number generator.
#The major advantage of setting a seed is that you can get the same sequence of random numbers
#whenever you supply the same seed in the random number generator.
set.seed(1234)
#Then, you want to make sure that your data set is shuffled and that you have the same ratio between types in your training and test sets. You use the sample() function to take a sample with a size that is set as the number of rows of the data set.
#You sample with replacement: you choose from a vector of 2 elements and assign either 1 or 2 to the rows of the data set.
#The assignment of the elements is subject to probability weights of of 0.67 and 0.33.
ind < sample(2, nrow(data_norm), replace=TRUE, prob=c(0.67, 0.33))
#ind < sample(2, nrow(data), replace=TRUE, prob=c(0.67, 0.33))
# just seeing what happens if I don’t normalise data
# I want training set to be 2/3 of the original data set:
#assign 1 with a probability of 0.67 and the other with a probability of 0.33 to the sample rows.
#You can then use the sample that is stored in the variable ind to define your training and test sets:
data.training < data_norm[ind==1, 1:5]
data.test < data_norm[ind==2, 1:5]
#data.training < data[ind==1, 1:5]
#data.test < data[ind==2, 1:5]
# just seeing what happens if I don’t normalise data
#Note that, in addition to the 2/3 and 1/3 proportions specified above, you don’t take into account all attributes to form the training and test sets.
#This is because you actually want to predict the sixth attribute, type: it is your target variable.
#However, you do want to include it into the KNN algorithm, otherwise there will never be any prediction for it.
#You therefore need to store the class labels in factor vectors and divide them over the training and test sets.
# creating a blank 6th column
data.trainLabels < data_norm[ind==1, 6]
data.testLabels < data_norm[ind==2, 6]
#data.trainLabels < data[ind==1, 6]
#data.testLabels < data[ind==2, 6]
# just seeing what happens if I don’t normalise data
#The Actual KNN Model
#Building Your Classifier
#After all these preparation steps, you have made sure that all your known (training) data is stored.
#No actual model or learning was performed up until this moment. Now, you want to find the k nearest neighbors of your training set.
#An easy way to do these two steps is by using the knn() function,
#which uses the Euclidian distance measure in order to find the knearest neighbours to your new, unknown instance. Here, the k parameter is one that you set yourself.
#As mentioned before, new instances are classified by looking at the majority vote or weighted vote.
#In case of classification, the data point with the highest score wins the battle and the unknown
#instance receives the label of that winning data point.
#If there is an equal amount of winners, the classification happens randomly.
#Note the k parameter is often an odd number to avoid ties in the voting scores.
#To build your classifier, you need to take the knn() function and simply add some arguments to it,
#just like in this example:
data_pred < knn(train = data.training, test = data.test, cl = data.trainLabels, k=3)
#The result of this command is the factor vector with the predicted classes for each row of the test data.
#Evaluation of Your Model
data_pred
data.testLabels
#OUTPUT – 12 incorrect
#You see that the model makes reasonably accurate predictions,
install.packages(“gmodels”)
library(“gmodels”)
#Then you can make a cross tabulation or a contingency table.
#This type of table is often used to understand the relationship between two variables.
#In this case, you want to understand how the classes of your test data, stored in data.testLabels relate to your model that is stored in data_pred:
CrossTable(x = data.testLabels, y = data_pred, prop.chisq=FALSE)
#Note that the last argument prop.chisq indicates whether or not the chisquare contribution of each cell is included.
#The chisquare statistic is the sum of the contributions from each of the individual cells and is used to decide whether the difference between the observed and the expected values is significant.
#From this table, you can derive the number of correct and incorrect predictions: 6 instance of 0 being 1. 6 instance of 1 being 0
#In all other cases, correct predictions were made. You can conclude that the model’s performance is good enough and that you don’t need to improve the model!
# NB: the data that was not normalised didn’t perform too bad, however the normalised data was better.
References:
 http://www.rstatistics.com/2013/08/kmeansclusteringfromrinaction/
 https://www.datacamp.com/community/tutorials/machinelearninginr
 https://en.wikipedia.org/wiki/Machine_learning