day6/ME114_assignment6_solution.Rmd

---
title: "ME114 Day 6: Solutions for Assignment 6"
author: "Ken Benoit and Slava Mikhaylov"
output: html_document
---

### Exercise 6.1

Suppose that we have four observations, for which we compute a dissimilarity matrix, given by

$$\left[ \begin{array}{ccc}
 & 0.3 & 0.4 & 0.7 \\
0.3 &  & 0.5 & 0.8 \\
0.4 & 0.5 &  & 0.45 \\
0.7 & 0.8 & 0.45 &  
\end{array} \right]$$

For instance, the dissimilarity between the first and second observations is 0.3, and the dissimilarity between the second and fourth observations is 0.8.

(a) On the basis of this dissimilarity matrix, sketch the dendrogram that results from hierarchically clustering these four observations.  Use any type of *linkage* that you wish, but try to indicate which you have used.  (See James et al. 2013, pp395-396)

```{r}
d <-  as.dist(matrix(c(0, 0.3, 0.4, 0.7, 
                     0.3, 0, 0.5, 0.8,
                     0.4, 0.5, 0.0, 0.45,
                     0.7, 0.8, 0.45, 0.0), nrow=4))
plot(hclust(d, method="complete"))
```

(b) Compare these result to the plot of the dendrogram in R.  You can use `hclust()` to create the clusters, and the `plot()` method for this object to plot it.  See `?hclust` to see the options for linkage.  To get you started:  

```{r}
plot(hclust(d, method="single"))
```

(c) Suppose that we cut the dendrogram obtained in (b) such that two clusters result. Which observations are in each cluster?

**(1, 2, 3), (4)**

(d) It is mentioned in this theoretical topic that at each fusion in the dendrogram, the position of the two clusters being fused can be swapped without changing the meaning of the dendrogram. Draw a dendrogram that is equivalent to the dendrogram in (a), for which two or more of the leaves are repositioned, but for which the meaning of the dendrogram is the same.

```{r}
plot(hclust(d, method="complete"), labels=c(2,1,4,3))
```


### Exercise 6.2

In this problem, you will perform $K$-means clustering manually, with $K = 2$, on a small example with $n = 6$ observations and $p = 2$ features. The observations are as follows.

|Obs.|X1|X2|
|--|--|--|
|1 |1 |4 |
|2 |1 |3 |
|3 |0 |4 |
|4 |5 |1 |
|5 |6 |2 |
|6 |4 |0 |


```{r}
mydata <-  data.frame(X1 = c(1, 1, 0, 5, 6, 4), 
                      X2 = c(4, 3, 4, 1, 2, 0))
rownames(mydata) <- paste0("obs", 1:nrow(mydata))
mydata
```

(a) Plot the observations with $X1$ on the $x$-axis and $X2$ on the $y$-axis.

```{r}
plot(mydata, xlim = c(0,6), ylim = c(0,6))
```

(b) Randomly assign a cluster label to each observation. You can use the `sample()` command in R to do this. Report the cluster labels for each observation.

```{r}
set.seed(999)
labelvalues <- c("A", "B")
colorvalues <- c(A = "red", B = "blue")
(newlabels <-  sample(labelvalues, nrow(mydata), replace=TRUE))
```

(c) Compute the centroid for each cluster.

```{r}
(mydataSplit <- split(mydata, as.factor(newlabels)))
(centroidMeans <- lapply(mydataSplit, colMeans))
# do.call will "row" bind each row in the list
(centroidMeans <- do.call(rbind, centroidMeans))
```

Now we can plot the centroid means, with red for A, and blue for B:
```{r}
# plot the centroid means
plot(mydata, xlim = c(0,6), ylim = c(0,6), col = colorvalues[newlabels])
points(centroidMeans, pch = 19, col = colorvalues)
text(centroidMeans, labelvalues, pos = 1, col = colorvalues)
```

(d) **OPTIONAL** Assign each observation to the centroid to which it is closest, in terms of Euclidean distance. Report the cluster labels for each observation.

```{r}
distanceToCentroid <- function(centroid_means, data) {
    distances <- as.matrix(dist(rbind(centroid_means, data)))
    # we just want distances to A and B
    distances[3:nrow(distances), 1:2]
}

(distances <- distanceToCentroid(centroidMeans, mydata))

# reassign the label values indexed by the minimum in each column
(oldlabels <- newlabels)
(newlabels <- labelvalues[apply(distances, 1, which.min)])

# update centroidMeans
(mydataSplit <- split(mydata, as.factor(newlabels)))
(centroidMeans <- lapply(mydataSplit, colMeans))
(centroidMeans <- do.call(rbind, centroidMeans))

# update plot
plot(mydata, xlim = c(0,6), ylim = c(0,6), col = colorvalues[newlabels])
points(centroidMeans, pch = 19, col = colorvalues)
text(centroidMeans, labelvalues, pos = 1, col = colorvalues)
```

(e) **OPTIONAL** Repeat (c) and (d) until the answers obtained stop changing.

```{r}
# are we finished converging yet?
all.equal(newlabels, oldlabels)

(distances <- distanceToCentroid(centroidMeans, mydata))
(oldlabels <- newlabels)
(newlabels <- labelvalues[apply(distances, 1, which.min)])
all.equal(newlabels, oldlabels)
```

(f) In your plot from (a), color the observations according to the cluster labels obtained.

**Already in the solution above.**


### Exercise 6.3

In the section, we mentioned the use of correlation-based distance and Euclidean distance as dissimilarity measures for hierarchical clustering. It turns out that these two measures are almost equivalent: if each observation has been centred to have mean zero and standard deviation one, and if we let $r_{ij}$ denote the correlation between the $i$ th and $j$ th observations, then the quantity $1 − r_{ij}$ is proportional to the squared Euclidean distance between the $i$ th and $j$ th observations.

On the `USArrests` data, part of the base `R` distribution, show that this proportionality holds.

*Hint: The Euclidean distance can be calculated using the `dist()` function, and correlations can be calculated using the `cor()` function.*

```{r}
library(ISLR)
set.seed(1)
```

```{r}
dsc <-  scale(USArrests)
a <-  dist(dsc)^2
b <-  as.dist(1 - cor(t(dsc)))
summary(b/a)
```


### Exercise 6.4

Consider the `USArrests` data, which is part of the base `R` distribution. We will now perform hierarchical clustering on the states.

```{r}
library(ISLR)
set.seed(2)
```

(a) Using hierarchical clustering with complete linkage and Euclidean distance, cluster the states.

```{r}
hc.complete <-  hclust(dist(USArrests), method="complete")
plot(hc.complete)
```

(b) Cut the dendrogram at a height that results in three distinct clusters. Which states belong to which clusters?

```{r}
cutree(hc.complete, 3)
table(cutree(hc.complete, 3))
```

(c) Hierarchically cluster the states using complete linkage and Euclidean distance, *after scaling the variables to have standard deviation one*.  (You can use the `scale()` command for this.)
```{r}
USArrestsStandardized <- scale(USArrests)
apply(USArrestsStandardized, 2, mean)  # are these now zero?
apply(USArrestsStandardized, 2, sd)    # are these now one?
```


```{r}
dsc <-  scale(USArrests)
hc.s.complete <-  hclust(dist(dsc), method="complete")
plot(hc.s.complete)
```


(d) What effect does scaling the variables have on the hierarchical clustering obtained? In your opinion, should the variables be scaled before the inter-observation dissimilarities are computed? Provide a justification for your answer. 

```{r}
cutree(hc.s.complete, 3)
table(cutree(hc.s.complete, 3))
table(cutree(hc.s.complete, 3), cutree(hc.complete, 3))
```

**Scaling the variables affects the max height of the dendogram obtained from hierarchical clustering. Eyeballing, it seems not to affect the density (`bushiness') of the tree obtained. However, it does affect the clusters obtained from cutting the dendogram into 3 clusters. Probably for this dataset the data should be standardized because the data measured has different units ($UrbanPop$ compared to other three columns).**