K-Means and KNN Investigates

Abstract

In computer science Machine Learning evolved as the study of pattern recognition and computational learning theory in artificial intelligence.

The aim of machine learning is to construct models, using algorithms that can learn from a dataset, to enable predictions to be made on unseen data.

K-Means aims to cluster data into various types or groups. It is an unsupervised learning in that, its purpose is to discover the structure of the inputs and determine hidden patterns in the data.

K-Nearest Neighbours is a supervised learning as it uses a given type, often called a label, from the dataset. KNN learns about the data set and uses the label to make predictions on unseen data.

Introduction

The task is to find a dataset suitable for K-means cluster analysis and K-Nearest Neighbour predictions.

The data set I have chosen is weather recordings at Heathrow (London Airport).

The question I would like to answer is:

“Are there really four seasons?”

NB: This is a UK weather dataset

Data Source:

Data Set Information:

It is a true data set.

The data set comes from https://data.gov.uk/dataset/historic-monthly-meteorological-station-data/resource/16d65b07-aade-4585-b812-3ec0d05434d6

Heathrow (London Airport) Location 507800E 176700N, Lat 51.479 Lon -0.449, 25m amsl

Data includes:

By month:

  • Mean maximum temperature (tmax)
  • Mean minimum temperature (tmin)
  • Days of air frost (af)
  • Total rainfall (rain)
  • Total sunshine duration (sun)

Data

Visualise, explore and clean data

Data Issues to be addressed:

  • Missing data (more than 2 days missing in month) is marked by —.
  • Data Runs from Jan1948 – Jan2016
  • Estimated Data is marked with *
  • Sunshine data taken from an automatic Kipp & Zonen sensor is marked with #, otherwise the data is taken from a Campbell Stokes recorder.

Original dataset The dataset contains 7 Variables and 818 observations.

Final dataset contains 7 variables and 708 observations (59 years of data)

Tools used R and Excel

Variable Missing Values Comments – before cleaning data Comments   after cleaning data
yyyy NO Years 1948-1956 have data for af days and sun hours as “—“ : delete

Year 2016 has 1 month – take out to keep complete years

Years 1948-1956 deleted
mm No Months listed in numerical order : no issues
tmax degC

 

No No issues
tmin degC

 

No No issues
af days

 

No Data for year 1948 listed as “—“ : delete

48 (0.05%) values have strange data, spaces between numbers, looks like data entry error

“—“ :Deleted in Excel

 

Small %, so corrected input using best judgement

rain mm

 

No No issues
sun hours

 

No Data for years 1948 – 1956 listed as “—“ : delete

Some values have #: need to be cleaned

Years 1948-1956 deleted

# removed

type No New column added To enable clusters to be tested.

K-Means Clustering – Overview of Process

Reference Tutorial

  • http://www.r-statistics.com/2013/08/k-means-clustering-from-r-in-action/

1. Read in data

2. Install libraries

3. Visualise the data

kmeans1

4. As the range of the values varies, Standardise the values, (taking out column 1&2)

kmeans2

5. Create the function to create the clusters and run

6. Output suggests 2 clusters

kmeans3

kmeans4

7. In our dataset the data is distinguished by month, 1 for January, 2 for February etc, so there are 12 types, which are not suitable for 2 clusters. I had suggested seasons which would be 4 clusters, this is not suitable either, so:

8. I used a common sense approach to assigning a type value1 and 2 to the dataset, as the data lends itself nicely for this, I assigned type 1 to Nov, Dec, Jan, Feb and Mar and type 2 to Apr, May, Jun, Jul, Aug, Sep and Oct.

However if you have no idea about the data or how to distinguish the types you can use the cluster data aggregate output to create an attribute type in the dataset and assign a cluster value:

kmeans5

9. Test the fit of your clusters to see how accurate they are.

The output table below, shows that cluster 1, type 1 has 290 correct and 5 wrong and cluster 2, type 2 has 391 correct and 22 wrong. Seems like a pretty good model.

kmeans6

10. Re-confirm your findings by running ‘randindex’ which will give a % fit. This is suggesting 85% fit which is good.

kmeans7

K-Means Clustering – Summary

To summarize the finding of this data set according to K-means, we have 2 types of cluster, so in answer to my question:

“Are there really four seasons?”

No, we have 2 seasons:

  • Season 1 – Nov, Dec, Jan, Feb and Mar
  • Season 2 – Apr, May, Jun, Jul, Aug, Sep and Oct

 

K-Nearest Neighbour – Overview of Process

Reference tutorial 

  • https://www.datacamp.com/community/tutorials/machine-learning-in-r

1. Read in data

2. Install libraries

3. Visualise the data

knn1

4. Normalise the data: (type 1 is now type 0 and type 2 is now type 1)

knn2

5. Create a training and test data set with 2/3, 1/3 split

6. Check output of test data set for effectiveness

knn3

7. This model is showing only 6 errors for both types, with 94% and 96% accuracy, for each type respectively. This suggests that this is a good model.

K-Nearest Neighbour – Summary

To summerise the finding of this data set according to K-Nearest Neighbour, we have tested the 2 types of cluster, using a training and test data set. The model is a good fit so we can conclude:

  • Season 1 – Nov, Dec, Jan, Feb and Mar,
  • Season 2 – Apr, May, Jun, Jul, Aug, Sep and Oct

CONCLUSION

I posed the question “Are there really four seasons?” I used this dataset that has 59 years of data, which has monthly averages for various variables.

The tests that I ran suggest that in fact we have 2 seasons:

Season 1 – Nov, Dec, Jan, Feb and Mar

Season 2 – Apr, May, Jun, Jul, Aug, Sep and Oct

In reality, yes, it does feel like our seasons are no longer 4 distinct seasons, as one season feels like it rolls into the next, with the winter months being more distinguishable from the rest.

However, I tested these models using all available variables, given more time, I would probably test these models using different combinations of variable, eg: just temperature, just to see how they affect the output of clusters.

R Code and comments:

K-Means Clustering

#data <- read.csv(“heathrowdata1Type.csv”, header = TRUE)
data <- read.csv(“heathrowdata1Type_reverse.csv”, header = TRUE)

# Get a feel for the data
head(data)
str(data)
summary(data)

wssplot <- function(data, nc=15, seed=1234){
  wss <- (nrow(data)-1)*sum(apply(data,2,var))
  for (i in 2:nc){
    set.seed(seed)
    wss[i] <- sum(kmeans(data, centers=i)$withinss)}
  plot(1:nc, wss, type=”b”, xlab=”Number of Clusters”,
       ylab=”Within groups sum of squares”)}

# Standardized unit – data values have varying ranges
# also taking out column 1 mm and type
df <- scale(data[-1:-2])
head(df)

# determine number of clusters
wssplot(df)
# every knink in the graph shows a potential cluster of data, showing 15 clusters
# best number of clusters at elbow on graph
# identifying 2 as best number of clusters

# NbClust – has the clustering algorithms
install.packages(“NbClust”)
library(“NbClust”)
# setting random number generator
set.seed(1234)

# telling it we want min 2 clusters max 15, use kmean algorithm to determine which clusters to use.
nc <- NbClust(df, min.nc=2, max.nc=15, method=”kmeans”)
# see two graphs show – the out put According to the majority rule, the best number of clusters is 2

table(nc$Best.n[1,])

# 2 clusters
# Visualise bar chart
barplot(table(nc$Best.n[1,]),
        xlab=”Numer of Clusters”, ylab=”Number of Criteria”,
        main=”Number of Clusters Chosen by 26 Criteria”)

# set.seed allows you to set the seed for the random number generator
set.seed(1234)
fit.km <- kmeans(df, 2, nstart=25) # starts at 25 clusters and works its way back to 2 clusters
fit.km$size – # this outputs 2 clusters, of value 312 and 396, suggesting to me 1 cluster taking in more months!

fit.km$centers # this prints out the centre point of each cluster, if we look the centre point of each
#we can see a definate difference in each cluster.

# show the mean of each variable in each cluster
aggregate(data[-1:-2], by=list(cluster=fit.km$cluster), mean)

# show table of correct and incorrect count
ct.km <- table(data$type, fit.km$cluster)
ct.km

# OUTPUTS 1 and 2 (heathrowdata1Type.csv)LOOK WRONG WAY AROUND – change the types in the dataset to opposite way around
# OUTPUT 1: – when type 2 – jan, feb, nov, dec, type 1 the rest
#   1   2
#1  76 394
#2 236   2

# OUTPUT 2:  – when type 2 – jan, feb, mar, nov, dec, type 1 the rest.
#   1   2
#1  22 389
#2 290   7

# AFTER TYPE CHANGED AROUND (heathrowdata1Type_reverse.csv) – table reads correct now.
# OUTPUT 3: – when type 1 – jan, feb, mar, nov, dec and type 2 the rest
#   1   2
#1 290   5
#2  22 391

install.packages(“flexclust”)
library(“flexclust”)
#
randIndex(ct.km)
# OUTPUT 1: ARI – 0.6067875 – quite good fit??
# OUTPUT 2: ARI – 0.842598 – good fit.
# OUTPUT 3: ARI – 0.8530182 – good fit

KNN – K-Nearest Neighbours

data <- read.csv(“heathrowdata1_knn.csv”, header = TRUE)
head(data)

install.packages(“class”)
library(“class”)

# This is the normalisation function
normalize <- function(x) {
  num <- x – min(x)
  denom <- max(x) – min(x)
  return (num/denom)
}

# create dataframe of normalised data
data_norm <- as.data.frame(lapply(data[1:6], normalize))
head(data_norm)
summary(data_norm)

#Training And Test Sets

#To make your training and test sets, you first set a seed.
#This is a number of R’s random number generator.
#The major advantage of setting a seed is that you can get the same sequence of random numbers
#whenever you supply the same seed in the random number generator.
set.seed(1234)

#Then, you want to make sure that your data set is shuffled and that you have the same ratio between types in your training and test sets. You use the sample() function to take a sample with a size that is set as the number of rows of the data set.
#You sample with replacement: you choose from a vector of 2 elements and assign either 1 or 2 to the rows of the data set.
#The assignment of the elements is subject to probability weights of of 0.67 and 0.33.
ind <- sample(2, nrow(data_norm), replace=TRUE, prob=c(0.67, 0.33))

#ind <- sample(2, nrow(data), replace=TRUE, prob=c(0.67, 0.33))
# just seeing what happens if I don’t normalise data
# I want training set to be 2/3 of the original data set:
#assign 1 with a probability of 0.67 and the other with a probability of 0.33 to the sample rows.

#You can then use the sample that is stored in the variable ind to define your training and test sets:
data.training <- data_norm[ind==1, 1:5]
data.test <- data_norm[ind==2, 1:5]

#data.training <- data[ind==1, 1:5]
#data.test <- data[ind==2, 1:5]
# just seeing what happens if I don’t normalise data

#Note that, in addition to the 2/3 and 1/3 proportions specified above, you don’t take into account all attributes to form the training and test sets.
#This is because you actually want to predict the sixth attribute, type: it is your target variable.
#However, you do want to include it into the KNN algorithm, otherwise there will never be any prediction for it.
#You therefore need to store the class labels in factor vectors and divide them over the training and test sets.
# creating a blank 6th column
data.trainLabels <- data_norm[ind==1, 6]
data.testLabels <- data_norm[ind==2, 6]

#data.trainLabels <- data[ind==1, 6]
#data.testLabels <- data[ind==2, 6]
# just seeing what happens if I don’t normalise data
#The Actual KNN Model
#Building Your Classifier
#After all these preparation steps, you have made sure that all your known (training) data is stored.
#No actual model or learning was performed up until this moment. Now, you want to find the k nearest neighbors of your training set.
#An easy way to do these two steps is by using the knn() function,
#which uses the Euclidian distance measure in order to find the k-nearest neighbours to your new, unknown instance. Here, the k parameter is one that you set yourself.
#As mentioned before, new instances are classified by looking at the majority vote or weighted vote.
#In case of classification, the data point with the highest score wins the battle and the unknown
#instance receives the label of that winning data point.
#If there is an equal amount of winners, the classification happens randomly.

#Note the k parameter is often an odd number to avoid ties in the voting scores.
#To build your classifier, you need to take the knn() function and simply add some arguments to it,
#just like in this example:

data_pred <- knn(train = data.training, test = data.test, cl = data.trainLabels, k=3)
#The result of this command is the factor vector with the predicted classes for each row of the test data.

#Evaluation of Your Model

data_pred
data.testLabels
#OUTPUT – 12 incorrect
#You see that the model makes reasonably accurate predictions,
install.packages(“gmodels”)
library(“gmodels”)

#Then you can make a cross tabulation or a contingency table.
#This type of table is often used to understand the relationship between two variables.
#In this case, you want to understand how the classes of your test data, stored in data.testLabels relate to your model that is stored in data_pred:
CrossTable(x = data.testLabels, y = data_pred, prop.chisq=FALSE)

#Note that the last argument prop.chisq indicates whether or not the chi-square contribution of each cell is included.
#The chi-square statistic is the sum of the contributions from each of the individual cells and is used to decide whether the difference between the observed and the expected values is significant.

#From this table, you can derive the number of correct and incorrect predictions: 6 instance of 0 being 1. 6 instance of 1 being 0
#In all other cases, correct predictions were made. You can conclude that the model’s performance is good enough and that you don’t need to improve the model!

# NB: the data that was not normalised didn’t perform too bad, however the normalised data was better.

References:

  • http://www.r-statistics.com/2013/08/k-means-clustering-from-r-in-action/
  • https://www.datacamp.com/community/tutorials/machine-learning-in-r
  • https://en.wikipedia.org/wiki/Machine_learning

 

 

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *