forked from pjsio/ME114
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathME114_assignment6_solution.Rmd
210 lines (152 loc) · 7.3 KB
/
ME114_assignment6_solution.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
---
title: "ME114 Day 6: Solutions for Assignment 6"
author: "Ken Benoit and Slava Mikhaylov"
output: html_document
---
### Exercise 6.1
Suppose that we have four observations, for which we compute a dissimilarity matrix, given by
$$\left[ \begin{array}{ccc}
& 0.3 & 0.4 & 0.7 \\
0.3 & & 0.5 & 0.8 \\
0.4 & 0.5 & & 0.45 \\
0.7 & 0.8 & 0.45 &
\end{array} \right]$$
For instance, the dissimilarity between the first and second observations is 0.3, and the dissimilarity between the second and fourth observations is 0.8.
(a) On the basis of this dissimilarity matrix, sketch the dendrogram that results from hierarchically clustering these four observations. Use any type of *linkage* that you wish, but try to indicate which you have used. (See James et al. 2013, pp395-396)
```{r}
d <- as.dist(matrix(c(0, 0.3, 0.4, 0.7,
0.3, 0, 0.5, 0.8,
0.4, 0.5, 0.0, 0.45,
0.7, 0.8, 0.45, 0.0), nrow=4))
plot(hclust(d, method="complete"))
```
(b) Compare these result to the plot of the dendrogram in R. You can use `hclust()` to create the clusters, and the `plot()` method for this object to plot it. See `?hclust` to see the options for linkage. To get you started:
```{r}
plot(hclust(d, method="single"))
```
(c) Suppose that we cut the dendrogram obtained in (b) such that two clusters result. Which observations are in each cluster?
**(1, 2, 3), (4)**
(d) It is mentioned in this theoretical topic that at each fusion in the dendrogram, the position of the two clusters being fused can be swapped without changing the meaning of the dendrogram. Draw a dendrogram that is equivalent to the dendrogram in (a), for which two or more of the leaves are repositioned, but for which the meaning of the dendrogram is the same.
```{r}
plot(hclust(d, method="complete"), labels=c(2,1,4,3))
```
### Exercise 6.2
In this problem, you will perform $K$-means clustering manually, with $K = 2$, on a small example with $n = 6$ observations and $p = 2$ features. The observations are as follows.
|Obs.|X1|X2|
|--|--|--|
|1 |1 |4 |
|2 |1 |3 |
|3 |0 |4 |
|4 |5 |1 |
|5 |6 |2 |
|6 |4 |0 |
```{r}
mydata <- data.frame(X1 = c(1, 1, 0, 5, 6, 4),
X2 = c(4, 3, 4, 1, 2, 0))
rownames(mydata) <- paste0("obs", 1:nrow(mydata))
mydata
```
(a) Plot the observations with $X1$ on the $x$-axis and $X2$ on the $y$-axis.
```{r}
plot(mydata, xlim = c(0,6), ylim = c(0,6))
```
(b) Randomly assign a cluster label to each observation. You can use the `sample()` command in R to do this. Report the cluster labels for each observation.
```{r}
set.seed(999)
labelvalues <- c("A", "B")
colorvalues <- c(A = "red", B = "blue")
(newlabels <- sample(labelvalues, nrow(mydata), replace=TRUE))
```
(c) Compute the centroid for each cluster.
```{r}
(mydataSplit <- split(mydata, as.factor(newlabels)))
(centroidMeans <- lapply(mydataSplit, colMeans))
# do.call will "row" bind each row in the list
(centroidMeans <- do.call(rbind, centroidMeans))
```
Now we can plot the centroid means, with red for A, and blue for B:
```{r}
# plot the centroid means
plot(mydata, xlim = c(0,6), ylim = c(0,6), col = colorvalues[newlabels])
points(centroidMeans, pch = 19, col = colorvalues)
text(centroidMeans, labelvalues, pos = 1, col = colorvalues)
```
(d) **OPTIONAL** Assign each observation to the centroid to which it is closest, in terms of Euclidean distance. Report the cluster labels for each observation.
```{r}
distanceToCentroid <- function(centroid_means, data) {
distances <- as.matrix(dist(rbind(centroid_means, data)))
# we just want distances to A and B
distances[3:nrow(distances), 1:2]
}
(distances <- distanceToCentroid(centroidMeans, mydata))
# reassign the label values indexed by the minimum in each column
(oldlabels <- newlabels)
(newlabels <- labelvalues[apply(distances, 1, which.min)])
# update centroidMeans
(mydataSplit <- split(mydata, as.factor(newlabels)))
(centroidMeans <- lapply(mydataSplit, colMeans))
(centroidMeans <- do.call(rbind, centroidMeans))
# update plot
plot(mydata, xlim = c(0,6), ylim = c(0,6), col = colorvalues[newlabels])
points(centroidMeans, pch = 19, col = colorvalues)
text(centroidMeans, labelvalues, pos = 1, col = colorvalues)
```
(e) **OPTIONAL** Repeat (c) and (d) until the answers obtained stop changing.
```{r}
# are we finished converging yet?
all.equal(newlabels, oldlabels)
(distances <- distanceToCentroid(centroidMeans, mydata))
(oldlabels <- newlabels)
(newlabels <- labelvalues[apply(distances, 1, which.min)])
all.equal(newlabels, oldlabels)
```
(f) In your plot from (a), color the observations according to the cluster labels obtained.
**Already in the solution above.**
### Exercise 6.3
In the section, we mentioned the use of correlation-based distance and Euclidean distance as dissimilarity measures for hierarchical clustering. It turns out that these two measures are almost equivalent: if each observation has been centred to have mean zero and standard deviation one, and if we let $r_{ij}$ denote the correlation between the $i$ th and $j$ th observations, then the quantity $1 − r_{ij}$ is proportional to the squared Euclidean distance between the $i$ th and $j$ th observations.
On the `USArrests` data, part of the base `R` distribution, show that this proportionality holds.
*Hint: The Euclidean distance can be calculated using the `dist()` function, and correlations can be calculated using the `cor()` function.*
```{r}
library(ISLR)
set.seed(1)
```
```{r}
dsc <- scale(USArrests)
a <- dist(dsc)^2
b <- as.dist(1 - cor(t(dsc)))
summary(b/a)
```
### Exercise 6.4
Consider the `USArrests` data, which is part of the base `R` distribution. We will now perform hierarchical clustering on the states.
```{r}
library(ISLR)
set.seed(2)
```
(a) Using hierarchical clustering with complete linkage and Euclidean distance, cluster the states.
```{r}
hc.complete <- hclust(dist(USArrests), method="complete")
plot(hc.complete)
```
(b) Cut the dendrogram at a height that results in three distinct clusters. Which states belong to which clusters?
```{r}
cutree(hc.complete, 3)
table(cutree(hc.complete, 3))
```
(c) Hierarchically cluster the states using complete linkage and Euclidean distance, *after scaling the variables to have standard deviation one*. (You can use the `scale()` command for this.)
```{r}
USArrestsStandardized <- scale(USArrests)
apply(USArrestsStandardized, 2, mean) # are these now zero?
apply(USArrestsStandardized, 2, sd) # are these now one?
```
```{r}
dsc <- scale(USArrests)
hc.s.complete <- hclust(dist(dsc), method="complete")
plot(hc.s.complete)
```
(d) What effect does scaling the variables have on the hierarchical clustering obtained? In your opinion, should the variables be scaled before the inter-observation dissimilarities are computed? Provide a justification for your answer.
```{r}
cutree(hc.s.complete, 3)
table(cutree(hc.s.complete, 3))
table(cutree(hc.s.complete, 3), cutree(hc.complete, 3))
```
**Scaling the variables affects the max height of the dendogram obtained from hierarchical clustering. Eyeballing, it seems not to affect the density (`bushiness') of the tree obtained. However, it does affect the clusters obtained from cutting the dendogram into 3 clusters. Probably for this dataset the data should be standardized because the data measured has different units ($UrbanPop$ compared to other three columns).**