Unsupervised methods typically require more effort to interpret results. No response variable exists in the data nor so we don’t know exactly what the outcome will uncover.
Clustering is a familiar unsupervised method that groups data into clusters based on the analyst’s selected features and the variance of the features. Clustering requires data of different units be normalized and has a few tuning parameters:
The k
in K-means clustering represents the number of expected clusters. We define k
before the analysis.
Create some random data: - Create a matrix with 100 rows and two columns
set.seed(101)
x <- matrix(rnorm(100*2),100,2)
head(x)
[,1] [,2]
[1,] -0.3260365 0.2680658
[2,] 0.5524619 -0.5922083
[3,] -0.6749438 2.1334864
[4,] 0.2143595 1.1727487
[5,] 0.3107692 0.7467610
[6,] 1.1739663 -0.2305087
x_mean <- matrix(rnorm(8, sd=4), 4, 2)
which <- sample(1:4, 100, replace = TRUE)
x <- x+x_mean[which,]
Plot the randomly selected points. They are col indicates the points will be colored by the value of which
. The which
variable is just the numbers one through four repeated 100 times.
plot(x, col=which, pch=19)
Use K-means to find the clusters (we already know the clusters)
km.out <- kmeans(x, 4, nstart=15)
km.out
K-means clustering with 4 clusters of sizes 21, 30, 32, 17
Cluster means:
[,1] [,2]
1 -3.1068542 1.1213302
2 1.7226318 -0.2584919
3 -5.5818142 3.3684991
4 -0.6148368 4.8861032
Clustering vector:
[1] 2 3 3 4 1 1 4 3 2 3 2 1 1 3 1 1 2 3 3 2 2 3 1 3 1 1 2 2 3 1
[31] 1 4 3 1 3 3 1 2 2 3 2 2 3 3 1 3 1 3 4 2 1 2 2 4 3 3 2 2 3 2
[61] 1 2 3 4 2 4 3 4 4 2 2 4 3 2 3 4 4 2 2 1 2 4 4 3 3 2 3 3 1 2
[91] 3 2 4 4 4 2 3 3 1 1
Within cluster sum of squares by cluster:
[1] 30.82790 54.48008 71.98228 21.04952
(between_SS / total_SS = 87.6 %)
Available components:
[1] "cluster" "centers" "totss" "withinss"
[5] "tot.withinss" "betweenss" "size" "iter"
[9] "ifault"
total_SS = the percent of variation explained by the clusters. In this case it is 87.6%.
Create a plot of the results of the kmeans function. We overlay the first plot with the kmeans plot using a different marker to show a comparison of the two plots.
plot(x, col=km.out$cluster, cex=2, pch=1, lwd=2)
points(x, col=c(4,3,2,1)[which],pch=19)
We see the kmeans function was able to detect our manually created our clusters, mislabeling only two data points.
Use the same data but with the hierarchical clustering method. We do not provide a k
or number of clusters before hand. This method groups based on a distance parameter, making small groups and then large groups until all the data is in one group. The groupings are plotted in a dendrogram. We then select the number of groupings to be interpreted.
We use the complete
as he linkage method.
hc.complete <- hclust(dist(x), method="complete")
plot(hc.complete)
An analyst would select the cutoff height from which to interpret groupings.
hc.single <- hclust(dist(x), method="single")
plot(hc.single)
Compare the results at cutoff point 4
with our manually generated clusters to see if the function found the clusters.
hc.cut <- cutree(hc.complete, 4)
table(hc.cut,which)
which
hc.cut 1 2 3 4
1 0 0 30 0
2 1 31 0 2
3 17 0 0 0
4 0 0 0 19
The smaller numbers indicate misclassifications. There are three misclassifications (1+2)
Compare the results at cutoff point 4
with our kmeans
clusters to see if the function found the clusters.
table(hc.cut, km.out$cluster)
hc.cut 1 2 3 4
1 0 30 0 0
2 2 0 32 0
3 0 0 0 17
4 19 0 0 0
In this case there were two misclassifications.
plot(hc.complete, labels=which)