60-unsupervised.Rmd

---
output: html_document
editor_options: 
  chunk_output_type: console
---
# Unsupervised Learning {#unsupervised}

This chapter deals with machine learning problems which are unsupervised.
This means the machine has access to a set of inputs, $x$, but the desired outcome, $y$ is not available. 
Clearly, learning a relation between inputs and outcomes is impossible, but there are still a lot of problems of interest. 
In particular, we may want to find a compact representation of the inputs, be it for visualization of further processing. 
This is the problem of _dimensionality reduction_.
For the same reasons we may want to group similar inputs. This is the problem of _clustering_.

In the statistical terminology, with some exceptions, this chapter can be thought of as multivariate __exploratory__ statistics.
For multivariate __inference__, see Chapter \@ref(multivariate).


## Dimensionality Reduction {#dim-reduce}

```{example, label='bmi'}
Consider the heights and weights of a sample of individuals. 
The data may seemingly reside in $2$ dimensions but given the height, we have a pretty good guess of a persons weight, and vice versa. 
We can thus state that heights and weights are not really two dimensional, but roughly lay on a $1$ dimensional subspace of $\mathbb{R}^2$. 
```


```{example, label='iq'}
Consider the correctness of the answers to a questionnaire with $p$ questions. 
The data may seemingly reside in a $p$ dimensional space, but if there is a thing such as "skill", then given the correctness of a person's reply to a subset of questions, we have a good idea how he scores on the rest. 
If skill is indeed a one dimensional quality, then the questionnaire data should organize around a single line in the $p$ dimensional cube. 
```


```{example, label='blind-signal'}
Consider $n$ microphones recording an individual. 
The digitized recording consists of $p$ samples. 
Are the recordings really a shapeless cloud of $n$ points in $\mathbb{R}^p$?
Since they all record the same sound, one would expect the $n$ $p$-dimensional points to arrange around the original, noisless, sound: a single point in $\mathbb{R}^p$.
If microphones have different distances to the source, volumes and echoes may differ. 
We would thus expect the $n$ points to arrange about a __line__ in $\mathbb{R}^p$. 	
```


### Principal Component Analysis {#pca}

_Principal Component Analysis_ (PCA) is such a basic technique, it has been rediscovered and renamed independently in many fields. 
It can be found under the names of 
Discrete Karhunen–Loève Transform; Hotteling Transform; Proper Orthogonal Decomposition; Eckart–Young  Theorem; Schmidt–Mirsky Theorem; Empirical Orthogonal Functions; Empirical Eigenfunction Decomposition; Empirical Component Analysis; Quasi-Harmonic Modes; Spectral Decomposition; Empirical Modal Analysis, and possibly more^[http://en.wikipedia.org/wiki/Principal_component_analysis].
The many names are quite interesting as they offer an insight into the different problems that led to PCA's (re)discovery.


Return to the BMI problem in Example \@ref(exm:bmi).
Assume you wish to give each individual a "size score".
Also assume this score is a __linear__ combination of height and weight.
That is the problem solved by PCA: 
It returns the linear combination that has the largest variability, i.e., the combination which best distinguishes between individuals. 

The variance maximizing motivation above was the one that guided @hotelling1933analysis.
But $30$ years before him, @pearson1901liii derived the same procedure with a different motivation in mind. 
Pearson was also trying to give each individual a score. 
He did not care about variance maximization, however. 
He simply wanted a small set of coordinates in some (linear) space that approximates the original data well. 

Before we proceed, we give an example to fix ideas.
Consider the crime rate data in `USArrests`, which encodes reported murder events, assaults, rapes, and the urban population of each american state. 

```{r}
head(USArrests)
```

Following Hotelling's motivation, we may want to give each state a "crimilality score". 
We first remove the `UrbanPop` variable, which does not encode crime levels. 
We then z-score each variable with `base::scale()`, and call PCA for a sequence of $1,\dots,3$ criminality scores that best separate between states. 

```{r, echo=FALSE}
library(magrittr)
```


```{r, cache=TRUE}
USArrests.1 <- USArrests[,-3] %>% scale 
pca.1 <- prcomp(USArrests.1, scale = TRUE)
pca.1
```

Things to note and terminology:

- The score that best distinguishes between states should be indifferent to the __average__ of each variable. 
We also don't want the score to be sensitive to the measurement scale. 
Formally, we want the scores to be _affine invariant_. 
We thus perform PCA in the z-score scale of each variable, obtained with the `scale` function.

- PCA is performed with the `stats::prcomp` function. 
It returns the contribution (weight) of the original variables, to the new crimeness score. After rescaling, these weights are called the _loadings_. 
Borrowing from the __Factor Analaysis__ literature, the loadings may be called _Rotations_, which is their name in the `stats::prcomp()` output. If you are confused between weights, loadings and rotations, see this [Cross Validated](https://stats.stackexchange.com/questions/143905/loadings-vs-eigenvectors-in-pca-when-to-use-one-or-another) entry.

- The number of possible scores, is the same as the number of original variables in the data. 

- The new scores are called the _principal components_, labeled `PC1`,...,`PC3` in our output. They are computed by summing the original variables weighted by their loadings. 

- The loadings/rotation on PC1 tell us that the best separation between states is along the average crime rate. 
Why is this? 
Because all the $3$ crime variables have a similar loading on PC1.

- The other PCs are slightly harder to interpret, but it is an interesting exercise.

__If we now represent each state, not with its original $3$ crimes measurements variables, but only with the first $2$ PCs (for example), we have reduced the dimensionality of the data.__

#### Mathematics of PCA

What is the mathematical problem that is actually solved with PCA?
Finding a linear combination ($v$) of the original variables ($x$), so that the new score/index ($v'x$) best separates individuals.
Best separation implies that the variance of $v'x$ is maximal.
Clearly, $Var[v'x]$ may explode if any $v$ is allowed, so we need to pick $v$ from some "fair" set. 
It is most convenient, mathematically, to constrain the $l_2$ norm to some constant: $\Vert v \Vert^2_2=\sum v_j^2=1$.
The first "best separating score", known as the first _principal component_ (PC), is thus 
$$v_1'x \quad s.t. \quad v_1=argmax_{v}\{Var[v'x], \text{ and } \Vert v \Vert_2=1  \} .$$
The second PC, is the same, only that it is required to be orthogonal to the first PC:
$$v_2'x \quad s.t. \quad v_2=argmax_{v}\{Var[v'x], \text{ and } \Vert v \Vert_2=1, \text{ and } v'v_1=0 \} .$$
The construction of the next PCs follows the same lines: find a linear transformation of the data that best separates observations and is orthogonal to the previous PCs. 


#### How Hard is the PCA Problem?

Estimating all the PCs in the data is well defined algebraically if $n>p$, in which case $p$ PCs are computable. 
This is the algebraic part of the problem, which is rather easy, and solved with [SVD](https://en.wikipedia.org/wiki/Singular-value_decomposition).

If viewing PCA as inference tool, we may ask about its statistical performance. 
It turns out that PCA has the same statistical difficulty as estimating a covariance matrix. 
As we already saw in the Multivariate Statistics Chapter (\@ref(multivariate)), estimating covariances is a hard task, we thus recommend: don't trust your PCs if $n$ is not much larger than $p$, and see the bibliographic notes for further details. 


### Dimensionality Reduction Preliminaries

Before presenting methods other than PCA, we need some terminology.

- __Variable__: 
A.k.a. _dimension_, or _feature_, or _column_. A vector of $p$ measurements in their raw scale.

- __Feature Mapping__:
A.k.a. _variable transformation_, or _data augmentation_. A measurement in a new, transformed, scale. 

- __Data__: 
A.k.a. _sample_, _observations_. 
Will typically consist of $n$, vectors of dimension $p$.
We typically denote the data as a $n\times p$ matrix $X$. 

- __Data space__:
A.k.a. _feature space_. The space of all possible values of $X$. We denote with $\mathcal{X}$.

- __Network__:
A representation of the similarities (or dissimilarities) between the $n$ units in the data. We denote with $\mathcal{G}$, and may be encoded in an $n \times n$ matrix. 

- __Manifold__: 
A generalization of a linear space, which is regular enough so that, __locally__, it has all the properties of a linear space. 
We will denote an arbitrary manifold by $\mathcal{M}$, and by $\mathcal{M}_q$ a $q$ dimensional^[You are probably used to thinking of the __dimension__ of linear spaces. We will not rigorously define what is the dimension of a manifold, but you may think of it as the number of free coordinates needed to navigate along the manifold.] manifold.

- __Embedding__: 
Informally speaking: a "shape preserving" mapping (see figure below). We denote an embedding of the data $X$, into a manifold $\mathcal{M}$ by $X\mapsto \mathcal{M}$.

- __Embedding Function__:
If the embedding is not only an algorithm, but rather, has a functional form representation, we call it an _embedding function_ $f$. Given such a function, we are not restricted to embeddings of the original data, $X$, but may also embed new data points from $\mathcal{X}$: $f:\mathcal{X}\mapsto\mathcal{M}$.


- __Generative Model__: 
Known to statisticians as the __sampling distribution__. 
The assumed stochastic process that generated the observed $X$. 

![Various embedding algorithms. No embedding of the sphere to the plane is perfect. This is obviously not new. Maps makers have known this for centuries!](art/sphx_glr_plot_manifold_sphere_001.png)


There are many motivations for dimensionality reduction:

1. __Scoring__: 
Give each observation an interpretable, simple score (Hotelling's motivation).

1. __Latent structure__: 
Recover unobservable information from indirect measurements. 
E.g: Blind signal reconstruction, CT scan, cryo-electron microscopy, etc. 

1. __Signal to Noise__: 
Denoise measurements before further processing like clustering, supervised learning, etc. 

1. __Compression__: 
Save on RAM ,CPU, and communication when operating on a lower dimensional representation of the data. 
	

### Latent Variable Generative Approaches
All generative approaches to dimensionality reduction will include a set of latent/unobservable variables, which we can try to recover from the observables $X$. 
The unobservable variables will typically have a lower dimension than the observables, thus, dimension is reduced. 
We start with the simplest case of linear Factor Analysis. 


#### Factor Analysis (FA) {#factor-analysis}

FA originates from the psychometric literature. 
We thus revisit the IQ (actually g-factor^[https://en.wikipedia.org/wiki/G_factor_(psychometrics)]) Example \@ref(exm:iq):

```{example}
Assume $n$ respondents answer $p$ quantitative questions: $x_i \in \mathbb{R}^p, i=1,\dots,n$. 
Also assume, their responses are some linear function of a single personality attribute, $s_i$. 
We can think of $s_i$ as the subject's "intelligence".
We thus have 
\begin{align}
	x_i = s_i A  + \varepsilon_i
\end{align}
And in matrix notation:
\begin{align}
	X = S A+\varepsilon,
	(\#eq:factor)
\end{align}
where $A$ is the $q \times p$ matrix of factor loadings, and $S$ the $n \times q$ matrix of latent personality traits. 
In our particular example where $q=1$, the problem is to recover the unobservable intelligence scores, $s_1,\dots,s_n$, from the observed answers $X$.	
```


We may try to estimate $S A$ by assuming some distribution on $S$ and $\varepsilon$ and apply maximum likelihood.
Under standard assumptions on the distribution of $S$ and $\varepsilon$, recovering  $S$ from $\widehat{S A }$ is still impossible as there are infinitely many such solutions.
In the statistical parlance we say the problem is _non identifiable_, and in the applied mathematics parlance we say the problem is _ill posed_.


```{remark}
The non-uniqueness (non-identifiability) of the FA solution under variable rotation is never mentioned in the PCA context. 
Why is this?
This is because PCA and FA solve different problems. 
$\hat S$ in PCA is well defined because PCA does not seek a single $S$ but rather a __sequence__ of $S_q$ with dimensions growing from $q=1$ to $q=p$. 
```


The FA terminology is slightly different than PCA:

- __Factors__: 
The unobserved attributes $S$. 
Akin to the _principal components_ in PCA.

- __Loading__:
The $A$ matrix; the contribution of each factor to the observed $X$.

- __Rotation__:	
An arbitrary orthogonal re-combination of the factors, $S$, and loadings, $A$, which changes the interpretation of the result.


The FA literature offers several heuristics to "fix" the identifiability problem of FA. 
These are known as _rotations_, and go under the names of _Varimax_, _Quartimax_, _Equimax_, _Oblimin_, _Promax_, and possibly others. 

Because of their great similarity, FA is often confused with PCA. 
For a discussion of the similarities and dissimilarities, see this excellent [StackExchange Q](https://stats.stackexchange.com/questions/123063/is-there-any-good-reason-to-use-pca-instead-of-efa-also-can-pca-be-a-substitut).


#### Independent Component Analysis (ICA)
Like FA, _independent compoent analysis_ (ICA) is a family of latent space models, thus, a _meta-method_.
It assumes the observables are some function of the latent variables $S$. 
In many cases this function is assumed to be linear in $S$ so that ICA is compared, if not confused, with PCA and even more so with FA. 

The fundamental idea of ICA is that $S$ has a joint distribution of __non-Gaussian__, __independent__ variables. 
This independence assumption, solves the the non-uniqueness of $S$ in FA.
As such, it can be thought of as a type of rotation in FA.
Then again, if the assumed distribution of $S$ is both non-Gaussian, and well justified, then ICA is well defined, and more than just an arbitrary rotation of FA.

Being a generative model, estimation of $S$ can then be done using maximum likelihood, or other estimation principles. 

ICA is a popular technique in signal processing, where $A$ is actually the signal, such as sound in Example \@ref(exm:blind-signal).
Recovering $A$ is thus recovering the original signals mixing in the recorded $X$. 


### Purely Algorithmic Approaches

We now discuss dimensionality reduction approaches that are not stated via their generative model, but rather, directly as an algorithm.
This does not mean that they cannot be cast via their generative model, but rather they were not motivated as such.


#### Multidimensional Scaling (MDS)
MDS can be thought of as a variation on PCA, that begins with the $n \times n$ graph of distances between data points $\mathcal{G}$; in contrast to PCA which operates on the original $n \times p$ data $X$.
The term _graph_ is typically used in this context, but saying _network_ instead of _graph_ is more accurate. This is because a graph encodes connections (topology) and networks encode distances (geometry). Put differently, a graph can be encoded in a matrix of zeroes and ones, and a network in a matrix of real numbers. 

MDS aims at embedding $\mathcal{G}$ into the plane, typically for visualization, while preserving the original distances.
Basic results in graph/network theory suggest that the geometry of a graph cannot be preserved when embedding it into lower dimensions [@graham1988isometric]. 
The different types of MDSs, such as _Classical MDS_, and _Sammon Mappings_, differ in the _stress function_ that penalizes for the geometric distortion caused by the embedding.


#### Local Multidimensional Scaling (local-MDS) {#loc-mds}

```{example, label='non-euclidean'}
Consider data, $X$, of Cartesian coordinates on the globe. 
At short distances, constructing a dissimilarity graph, $X \mapsto \mathcal{G}$ using Euclidean distances between Cartesian coordinates will capture the true distance between points. 
At long distances, however, the Euclidean distances, are a very poor aproximation of the distance to travel between points on the globe. 
A more extreme example is coordinates on the brain's cerebral cortex.
Being a highly folded surface, the Euclidean distance between points is far from the true geodesic distances along the cortex's surface^[Then again, it is possible that the true distances are the white matter fibers connecting going within the cortex, in which case, Euclidean distances are more appropriate than geodesic distances. We put that aside for now.].
```

local-MDS is aimed at solving the case where Euclidean distances, implied by PCA and FA, are a bad measure of distance. 
Instead of using the graph of Euclidean distances between any two points, $\mathcal{G}=X'X$, local-MDS computes $\mathcal{G}$ starting with the Euclidean distance between pairs of nearest points. 
Longer distances are solved as a [shortest path problem](https://en.wikipedia.org/wiki/Shortest_path_problem).
For instance, using the [Floyd–Warshall algorithm](https://en.wikipedia.org/wiki/Floyd–Warshall_algorithm), which sums distances between closest points. 
This is akin to computing the distance between Jerusalem to Beijing by computing Euclidean distances between Jerusalem-Bagdad, Bagdad-Teheran, Teheran-Ashgabat, Ashgabat-Tashkent,and so on. Because the [geographical-distance](https://en.wikipedia.org/wiki/Geographical_distance) between nearby cities is well approximated with the Euclidean distance, summing local distanes is better than operating directly with the Euclidean distance between Jerusalem and Beijing. 

After computing $\mathcal{G}$, local-MDS ends with the usual MDS for the embedding.
Because local-MDS ends with a regular MDS, it can be seen as a non-linear embedding into a linear manifold $\mathcal{M}$. 


#### Isometric Feature Mapping (IsoMap) {#isomap}
Like localMDS, only that the embedding, and not only the computation of the distances, is local.


#### Local Linear Embedding (LLE)
Very similar to IsoMap \@ref(isomap).


#### t-SNE
t-SNE is a recently popularized visualization method for high dimentional data. 
t-SNE starts by computing a proximity graph, $\mathcal{G}$. 
Computation of distances in the graph assumes a Gaussian decay of distances. 
Put differently: only the nearest observations have a non-vanishing similarity. 
This stage is similar (in spirit) to the growing of $\mathcal{G}$ in local-MDS (\@ref(loc-mds)).

The second stage in t-SNE consists of finding a mapping to 2D (or 3D), which conserves distances in $\mathcal{G}$. 
The uniquness of t-SNE compared to other space embeddings is in the way distances are computed in the target 2D (or 3D) space.


#### UMAP
https://pair-code.github.io/understanding-umap/

#### Force Directed Graph Drawing

This class of algorithms start with a proximty graph $\mathcal{G}$, and define a set of phisically motivated "forces", operating between data-points. 
Think of $\mathcal{G}$ as governing a set of springs between data points. 
These springs have some steady-state.
The location of points in the embedding corrsponds to the steady state of this system of springs. 


#### Kernel PCA (kPCA)
Returning to the BMI example (\@ref(exm:bmi)); what if we want to learn scores that best separate between individuals, but unlike PCA, are non-linear in the original features. 
Kernel PCA does just that, only that it restricts the possible scores to simple functions of the original variables.
These functions are known as the _feature mapping_.
The feature mappings resides in a function space called Reproducing Kernel Hilbert Space (RKHS), thus giving kPCA it's name.
By defining a _Kernel_ one defines the class of feature mappings implied by the algorithm.
The magic of kPCA, like other kernel methods, is that even if one chooses a kernel maps $p$ features, to an infinte dimensional space, the solution to the kPCA problem has a closed form solution. 
This implies that theoreticians may study the statisticla properties of kPCA, and solutions do not require solving optimization probelm in  untractable function spaces. 


#### Sparse PCA (sPCA)
sPCA is a type of PCA where the loadings are sparse. 
This means that PCs are linear combinations of a small number of variables. 
This makes sPCA easier to interpret. 
Note that the _varimax_ rotation in factor-analysis (\@ref(factor-analysis)) has a similar goal: create factors with the smallest number of contributing variables, so that they are easy to explain. 


#### Sparse kernel PCA (skPCA)
A marriage between sPCA and kPCA: generate scores that are non linear transformations of a small number of variables, each. 


#### Correspondence Analysis (CA)
What if $x$ is not continuous, i.e., $\mathcal{X}\neq \mathbb{R}^p$?
We could dummy-code $x$, and then use plain PCA. 
A more principled view, when $x$ is categorical, is known as _Correspondence Analysis_.


### Dimensionality Reduction in R

#### PCA {#pca-in-r}

We already saw the basics of PCA in \@ref(pca).
The fitting is done with the `stats::prcomp` function.
The _bi-plot_ is a useful way to visualize the output of PCA.

```{r}
# library(devtools)
# install_github("vqv/ggbiplot")
ggbiplot::ggbiplot(pca.1) 
```

Things to note:


- We used the `ggbiplot::ggbiplot` function  (available from github, but not from CRAN), because it has a nicer output than `stats::biplot`. 
- The data is presented in the plane of PC1 and PC2.
- The bi-plot plots the loadings as arrows. The coordinates of the arrows belong to the weight of each of the original variables in each PC. 
For example, the x-value of each arrow is the loadings on the PC1.
Since the weights of Murder, Assault, and Rape are almost the same, we conclude that PC1 captures the average crime rate in each state.

The _scree plot_ depicts the quality of the approximation of $X$ as $q$ grows, i.e., as we allow increase the dimension of our new score. 
This is depicted using the proportion of variability in $X$ that is removed by each added PC.
It is customary to choose $q$ as the first PC that has a relative low contribution to the approximation of $X$.
This is known as the "knee heuristic". 

```{r scree}
ggbiplot::ggscreeplot(pca.1)
```

The scree plot suggests a PC1 alone captures about 0.8 of the variability in crime levels. 
The next plot, is the classical class-room introduction to PCA.
It shows that states are indeed arranged along a single line in the "Assault-Murder" plane. This line is PC1.

```{r, echo=FALSE}
USArrests.1 <- USArrests[,-3] %>% scale
load <- pca.1$rotation
slope <- load[2, ]/load[1, ]
mn <- apply(USArrests.1, 2, mean)
intcpt <- mn[2] - (slope * mn[1])

# scatter plot with the two new axes added
USArrests.2 <- USArrests[,1:2] %>%  scale
xlim <- range(USArrests.2)  # overall min, max
plot(USArrests.2, xlim = xlim, ylim = xlim, pch = 16, col = "purple")  # both axes same length
abline(intcpt[1], slope[1], lwd = 2)  # first component solid line
abline(intcpt[2], slope[2], lwd = 2, lty = 2)  # second component dashed
legend("right", legend = c("PC 1", "PC 2"), lty = c(1, 2), lwd = 2, cex = 1)

# projections of points onto PCA 1
y1 <- intcpt[1] + slope[1] * USArrests.2[, 1]
x1 <- (USArrests.1[, 2] - intcpt[1])/slope[1]
y2 <- (y1 + USArrests.1[, 2])/2
x2 <- (x1 + USArrests.1[, 1])/2
segments(USArrests.1[, 1], USArrests.1[, 2], x2, y2, lwd = 2, col = "purple")
```

More implementations of PCA:
```{r many PCA implementations, eval=FALSE}
# FAST solutions:
gmodels::fast.prcomp()

# More detail in output:
FactoMineR::PCA()

# For flexibility in algorithms and visualization:
ade4::dudi.pca()

# Another one...
amap::acp()
```


#### FA


```{r FA, cache=TRUE}
fa.1 <- psych::principal(USArrests.1, nfactors = 2, rotate = "none")
fa.1
biplot(fa.1, labels =  rownames(USArrests.1)) 
```

Numeric comparison with PCA:
```{r}
fa.1$loadings
pca.1$rotation
```


Things to note:

- We perform FA with the `psych::principal` function. The `Principal Component Analysis` title is due to the fact that FA without rotations, is equivalent to PCA.
- The first factor (`fa.1$loadings`) has different weights than the first PC (`pca.1$rotation`) because they have different normalizations. They have the same interpretation however:  all weights of the first component are simiar, suggesting it merely captures the average crime rate.

Graphical model fans will like the following plot, where the contribution of each variable to each factor is encoded in the width of the arrow. 

```{r, eval=TRUE}
qgraph::qgraph(fa.1$loadings)
```

Let's add a rotation (Varimax), and note that the rotation has indeed changed the loadings of the variables, thus the interpretation of the factors. 

```{r varimax, cache=TRUE}
fa.2 <- psych::principal(USArrests.1, nfactors = 2, rotate = "varimax")

fa.2$loadings
```

Things to note:

- FA with a rotation is no longer equivalent to PCA.
- The rotated factors are now called _rotated componentes_, and reported in `RC1` and `RC2`.


#### ICA

```{r ICA, cache=TRUE}
ica.1 <- fastICA::fastICA(USArrests.1, n.com=2) # Also performs projection pursuit

plot(ica.1$S)
abline(h=0, v=0, lty=2)
text(ica.1$S, pos = 4, labels = rownames(USArrests.1))

# Compare with PCA (first two PCs):
arrows(x0 = ica.1$S[,1], y0 = ica.1$S[,2], 
       x1 = -pca.1$x[,2], y1 = pca.1$x[,1], 
       col='red', pch=19, cex=0.5)
```

Things to note:

- We fit ICA with `fastICA::fastICA`. 
- The ICA components, like any other rotated components, are different than the PCA components. 
- The fastICA algorithm has a stochastic component. So the solution will be different at each re-run (making comparison to PCA harder).


#### MDS

Classical MDS compared to PCA.
```{r MDS, results='hold', cache=TRUE}
# We first need a dissimarity matrix/graph:
state.disimilarity <- dist(USArrests.1)

mds.1 <- cmdscale(state.disimilarity)

plot(mds.1, pch = 19)
abline(h=0, v=0, lty=2)
USArrests.2 <- USArrests[,1:2] %>%  scale
text(mds.1, pos = 4, labels = rownames(USArrests.2), col = 'tomato')

# Compare with PCA (first two PCs):
points(pca.1$x[,1:2], col='red', pch=19, cex=0.5)
```

Things to note:

- We first compute a dissimilarity graph with `stats::dist()`. See `cluster::daisy` for a wider variety of dissimilarity measures.
- We learn the MDS embedding with `stats::cmdscale`.
- Embedding with the first two components of PCA is exactly the same as classical MDS with Euclidean distances. 


Let's try other strain functions for MDS, like Sammon's stress, and compare it with PCA.
```{r SammonMDS, results='hold'}
mds.2 <- MASS::sammon(state.disimilarity, trace = FALSE)
plot(mds.2$points, pch = 19)
abline(h=0, v=0, lty=2)
text(mds.2$points, pos = 4, labels = rownames(USArrests.2))

# Compare with PCA (first two PCs):
arrows(
  x0 = mds.2$points[,1], y0 = mds.2$points[,2], 
  x1 = pca.1$x[,1], y1 = pca.1$x[,2], 
  col='red', pch=19, cex=0.5)
```

Things to note:

- `MASS::sammon` does the embedding.
- Sammon stress is different than PCA.


#### t-SNE
For a native R implementation: [tsne package](https://cran.r-project.org/web/packages/tsne/).
For an R wrapper for C libraries: [Rtsne package](https://github.com/jkrijthe/Rtsne).


The [MNIST](http://yann.lecun.com/exdb/mnist/) image bank of hand-written images has its own data format. The import process is adapted from [David Dalpiaz](https://gist.github.com/daviddalpiaz/ae62ae5ccd0bada4b9acd6dbc9008706):
```{r}
show_digit <- function(arr784, col = gray(12:1 / 12), ...) {
  image(matrix(as.matrix(arr784[-785]), nrow = 28)[, 28:1], col = col, ...)
}

# load image files
load_image_file <- function(filename) {
  ret <- list()
  f <- file(filename, 'rb')
  readBin(f, 'integer', n = 1, size = 4, endian = 'big')
  n    <- readBin(f, 'integer', n = 1, size = 4, endian = 'big')
  nrow <- readBin(f, 'integer', n = 1, size = 4, endian = 'big')
  ncol <- readBin(f, 'integer', n = 1, size = 4, endian = 'big')
  x <- readBin(f, 'integer', n = n * nrow * ncol, size = 1, signed = FALSE)
  close(f)
  data.frame(matrix(x, ncol = nrow * ncol, byrow = TRUE))
}

# load label files
load_label_file <- function(filename) {
  f <- file(filename, 'rb')
  readBin(f, 'integer', n = 1, size = 4, endian = 'big')
  n <- readBin(f, 'integer', n = 1, size = 4, endian = 'big')
  y <- readBin(f, 'integer', n = n, size = 1, signed = FALSE)
  close(f)
  y
}

# load images
train <- load_image_file("data/train-images-idx3-ubyte")
test  <- load_image_file("data/t10k-images-idx3-ubyte")

# load labels
train$y = as.factor(load_label_file("data/train-labels-idx1-ubyte"))
test$y  = as.factor(load_label_file("data/t10k-labels-idx1-ubyte"))
```

Inspect some digits:
```{r}
par(mfrow=c(3,3))
ind <- sample(1:nrow(train),9)
for(i in 1:9){
  show_digit(train[ind[i],], main=paste('Label= ',train$y[ind[i]], sep=''))  }
```


The analysis is adapted from [Shruti Marwaha](https://rpubs.com/marwahsi/tnse).

```{r}
numTrain <- 5e3 # Subset data for speed
rows <- sample(1:nrow(train), numTrain)
train.sub <- train[rows, -which(names(train)=='y')] %>% as.matrix
train.sub.labs <- train[rows, which(names(train)=='y')]

tsne <- Rtsne::Rtsne(train.sub, dims = 2, perplexity=30, verbose=FALSE, max_iter = 500)

colors <- rainbow(length(unique(train.sub.labs)))
names(colors) <- unique(train.sub.labs)
par(mgp=c(2.5,1,0))
par(mfrow=c(1,1))
plot(tsne$Y, t='n', 
     main="tSNE", 
     xlab="tSNE dimension 1", 
     ylab="tSNE dimension 2", 
     "cex.main"=2, 
     "cex.lab"=1.5)
text(tsne$Y, labels=train.sub.labs, col=colors[train.sub.labs])
```


#### Force Embedding
I am unaware of an R implementation of force-embedding.
Maybe because of the interactive nature of the algorithm, that is not suited for R. 
Force embedding is much more natural to interactive GUIs. 
Here is a link for a fun [javascript implementation](http://bl.ocks.org/eesur/be2abfb3155a38be4de4).


#### Sparse PCA

```{r sPCA}
# Compute similarity graph
state.similarity <- MASS::cov.rob(USArrests.1)$cov

spca1 <- elasticnet::spca(state.similarity, K=2, type="Gram", sparse="penalty", trace=FALSE, para=c(0.06,0.16))
spca1$loadings
```

Things to note: 

- I used the `elasticnet::spca` function.
- Is the solutions sparse? Yes! PC2 depends on a single variable only: `Murder`.


#### Kernel PCA
```{r kPCA, eval=TRUE}
library(kernlab)
kpc <- kpca(~.,data=as.data.frame(USArrests.1), kernel="rbfdot", kpar=list(sigma=0.2), features=2)

plot(rotated(kpc),
     xlab="1st Principal Component",
     ylab="2nd Principal Component")

abline(h=0, v=0, lty=2)
text(rotated(kpc), pos = 4, labels = rownames(USArrests.2))
```

Things to note:

- We used `kernlab::kpca` for kPCA. 
- `rotated` projects the data on its principal components (the above "scores").
- See `?'kpca-class'` or `?rotated` for help on available utility functions.
- `kernel=` governs the class of feature mappings. 
- `kpar=list(sigma=0.2)` provides parameters specific to each type of kernel. See `?kpca`.
- `features=2` is the number of principal components (scores) to learn. 
- You may notice the "Horseshoe" pattern of the kPCA embedding, a.k.a. "Croissants", or the _Guttman Effect_. The horseshoe is known phenomenon in low dimensional visualizations. See [J. De Leeuw's online paper](https://rpubs.com/deleeuw/133786) for more details.


#### Multiple Correspondence Analysis (MCA)

See @izenman2008modern.


\pagebreak
## Clustering {#cluster}


```{example, label="photos"}
Consider the tagging of your friends' pictures on Facebook. 
If you tagged some pictures, Facebook may try to use a supervised approach to automatically label photos. 
If you never tagged pictures, a supervised approach is impossible. 
It is still possible, however, to group simiar pictures together.
```


```{example, label="spam"}
Consider the problem of spam detection. 
It would be nice if each user could label several thousands emails, to apply a supervised learning approach to spam detection.
This is an unrealistic demand, so a _pre-clustering_ stage is useful: the user only needs to label a couple dozens of (hopefully homogenous) clusters, before solving the supervised learning problem. 
```


In clustering problems, we seek to group observations that are similar. 

There are many motivations for clustering:

1. __Understanding__:
The most common use of clustering is probably as a an exploratory step, to identify homogeneous groups in the data^[As such, you may wonder why clustering is part of machine learning at all? Maybe call it _machine-assisted human-learning_?].
1. __Dimensionality reduction__:
Clustering may be seen as a method for dimensionality reduction. 
Unlike the approaches in the Dimensionality Reduction Section \@ref(dim-reduce), it does not compress __variables__ but rather __observations__. 
Each group of homogeneous observations may then be represented as a single _prototypical observation_ of the group. 
1. __Pre-Labelling__:
Clustering may be performed as a pre-processing step for supervised learning, when labeling all the samples is impossible due to "budget" constraints, like in Example \@ref(exm:spam). This is sometimes known as _pre-clustering_.

Clustering, like dimensionality reduction, may rely on some latent variable generative model, or on purely algorithmic approaches.


### Latent Variable Generative Approaches


#### Finite Mixture {#finite-mixture}

```{example, label="males-females"}
Consider the distribution of heights.
Heights have a nice bell shaped distribution within each gender.
If genders have not been recorded, heights will be distributed like a _mixture_ of males and females. 
The gender in this example, is a _latent_ variable taking $K=2$ levels: male and female.
```

A _finite mixture_ is the marginal distribution of $K$ distinct classes, when the class variable is _latent_.
This is useful for clustering:
If we know the distribution of the sub-populations being mixed, we may simply cluster each data point to the most likely sub-group.
Learning how is each sub-population disctirubted, when no class labels are available is no easy task, but it is possible. 
For instance, by means of maximum likelihood.


### Purely Algorithmic Approaches

#### K-Means
The _K-means_ algorithm is possibly the most popular clustering algorithm.
The goal behind K-means clustering is finding a representative point for each of K clusters, and assign each data point to one of these clusters. 
As each cluster has a representative point, this is also known as a _prototype method_.
The clusters are defined so that they minimize the average Euclidean distance between all points to the center of the cluster.

In K-means, the clusters are first defined, and then similarities computed. 
This is thus a _top-down_ method.

K-means clustering requires the raw features $X$ as inputs, and not only a similarity graph, $\mathcal{G}$. 
This is evident when examining the algorithm below. 

The k-means algorithm works as follows:

1. Choose the number of clusters $K$.
1. Arbitrarily assign points to clusters.
1. While clusters keep changing:
    1. Compute the cluster centers as the average of their points.
    1. Assign each point to its closest cluster center (in Euclidean distance).
1. Return cluster assignments and means.

\bigskip

```{remark}
If trained as a statistician, you may wonder- what population quantity is K-means actually estimating?
The estimand of K-means is known as the _K principal points_.
Principal points are points which are _self consistent_, i.e., they are the mean of their neighbourhood. 
```


#### K-Means++
_K-means++_ is a fast version of K-means thanks to a smart initialization. 

#### K-Medoids

If a Euclidean distance is inappropriate for a particular set of variables, or that robustness to corrupt observations is required, or that we wish to constrain the cluster centers to be actual observations, then the _K-Medoids_ algorithm is an adaptation of K-means that allows this.
It is also known under the name _partition around medoids_ (PAM) clustering, suggesting _its relation to _graph partitioning_: a very important and well-studied problem in computer sciences. 

The k-medoids algorithm works as follows.

1. Given a dissimilarity graph, $\mathcal{G}$.
1. Choose the number of clusters $K$.
1. Arbitrarily assign points to clusters.
1. While clusters keep changing:
    1. Within each cluster, set the center as the data point that minimizes the sum of distances to other points in the cluster.
    1. Assign each point to its closest cluster center.
1. Return Cluster assignments and centers.

\bigskip

```{remark}
If trained as a statistician, you may wonder- what population quantity is K-medoids actually estimating?
The estimand of K-medoids is the median of their neighbourhood. 
A delicate matter is that quantiles are not easy to define for __multivariate__ variables so that the "multivaraitre median", may be a more subtle quantity than you may think. 
See @small1990survey.
```


#### Hirarchial Clustering
Hierarchical clustering algorithms take dissimilarity graphs as inputs.
Hierarchical clustering is a class of greedy _graph-partitioning_ algorithms. 
Being hierarchical by design, they have the attractive property that the evolution of the clustering can be presented with a _dendogram_, i.e., a tree plot. Another advantage of these methods is that they do not require an a-priori choice of the number of cluster ($K$).

Two main sub-classes of algorithms are _agglomerative_, and _divisive_.

_Agglomerative clustering_ algorithms are __bottom-up__ algorithm which build clusters by joining smaller clusters. 
To decide which clusters are joined at each iteration some measure of distance between clusters is required:

- __Single Linkage__:
Cluster distance is defined by the distance between the two __closest__ members.
- __Complete Linkage__:
Cluster distance is defined by the distance between the two __farthest__ members.
- __Group Average__:
Cluster distance is defined by the __average__ distance between members.
- __Group Median__: 
Like Group Average, only using the median.

_Divisive clustering_ algorithms are __top-down__ algorithm which build clusters by splitting larger clusters. 


### Clustering in R


#### K-Means
The following code is an adaptation from [David Hitchcock](http://people.stat.sc.edu/Hitchcock/chapter6_R_examples.txt).
```{r kmeans, cache=TRUE}
k <- 2
kmeans.1 <- stats::kmeans(USArrests.1, centers = k)
head(kmeans.1$cluster) # cluster asignments

pairs(USArrests.1, panel=function(x,y) text(x,y,kmeans.1$cluster))
```

Things to note:

- The `stats::kmeans` function does the clustering. 
- The cluster assignment is given in the `cluster` element of the `stats::kmeans` output.
- The visual inspection confirms that similar states have been assigned to the same cluster. 


#### K-Medoids
Start by growing a distance graph with `dist` and then partition using `pam`.
```{r kmedoids, cache=TRUE}
state.disimilarity <- dist(USArrests.1)
kmed.1 <- cluster::pam(x= state.disimilarity, k=2)
head(kmed.1$clustering)

plot(pca.1$x[,1], pca.1$x[,2], xlab="PC 1", ylab="PC 2", type ='n', lwd=2)
text(pca.1$x[,1], pca.1$x[,2], labels=rownames(USArrests.1), cex=0.7, lwd=2, col=kmed.1$cluster)
```

Things to note:

- K-medoids starts with the computation of a dissimilarity graph, done by the `dist` function.
- The clustering is done by the `cluster::pam` function.
- Inspecting the output confirms that similar states have been assigned to the same cluster. 
- Many other similarity measures can be found in `proxy::dist()`.
- See `cluster::clara()` for a big-data implementation of PAM.


#### Hirarchial Clustering

We start with agglomerative clustering with single-linkage.
```{r HirarchialClustering}
hirar.1 <- stats::hclust(state.disimilarity, method='single')
plot(hirar.1, labels=rownames(USArrests.1), ylab="Distance")
```

Things to note:

- The clustering is done with the `hclust` function.
- We choose the single-linkage distance using the `method='single'` argument.
- We did not need to a-priori specify the number of clusters, $K$, since all the possible $K$'s are included in the output tree. 
- The `plot` function has a particular method for `hclust` class objects, and plots them as dendograms.


We try other types of linkages, to verify that the indeed affect the clustering.
Starting with complete linkage.
```{r complete linkage}
hirar.2 <- hclust(state.disimilarity, method='complete')
plot(hirar.2, labels=rownames(USArrests.1), ylab="Distance")
```

Now with average linkage.
```{r average linkage}
hirar.3 <- hclust(state.disimilarity, method='average')
plot(hirar.3, labels=rownames(USArrests.1), ylab="Distance")
```


If we know how many clusters we want, we can use `stats::cuttree` to get the class assignments. 

```{r}
cut.2.2 <- cutree(hirar.2, k=2)
head(cut.2.2)
```

How to choose the number of clusters?
Just like the Scree Plot above, we can use a "knee heuristic". 
Because the length of a tree's branch is proportional to distances, long branches imply inhomogenous groups, while short branches imply homogeneous groups.
Here is a little simulation to demonstrate this:

```{r}
n.groups <- 3 # set the number of groups
data.p <- 10 # set the dimension of the data
data.n <- 100 # set the number of samples

# data with no separation between groups
the.data.10 <- mvtnorm::rmvnorm(n = data.n, mean = rep(0,data.p))  
data.disimilarity.10 <- dist(the.data.10)
hirar.10 <- hclust(data.disimilarity.10, method = "complete")
plot(hirar.10, ylab="Distance", main='All from the same group')

# data with strong separation between groups
the.data.11 <-the.data.10+sample(c(0,10,20), data.n, replace=TRUE) # Shift each group 
data.disimilarity.11 <- dist(the.data.11)
hirar.11 <- hclust(data.disimilarity.11, method = "complete")
plot(hirar.11, ylab="Distance", main=paste('Strong Separation Between',n.groups, 'Groups'))
```


## Bibliographic Notes
An excellent reference on multivariate analysis (exploratory and inference) is @izenman2008modern. 
For some theory of PCA see my [Dimensionality Reduction Class Notes](https://github.com/johnros/dim_reduce/blob/master/dim_reduce.pdf) and references therein.
For a SUPERB, interactive, visual demonstration of dimensionality reduction, see [Christopher Olah's](http://colah.github.io/posts/2014-10-Visualizing-MNIST/) blog. 
And also [Jay Alammar's](https://jalammar.github.io/illustrated-word2vec/).
For t-SNE see the creator's site: [Laurens van der Maaten](https://lvdmaaten.github.io/tsne/).
For an excellent book on kernel methods (RKHS) see @shawe2004kernel.
For more on everything, see @friedman2001elements.
For a softer introduction (to everything), see @james2013introduction.


## Practice Yourself

1. Generate data from multivariate Gaussian data with `mvtnorm::rmvnorm()`. Clearly this data has no structure at all: it is a $p$-dimensional shapeless cloud of $n$ points. 
    1. Now try various dimensionality reduction algorithms such as PCA, MDS, kPCA, sPCA. How does the sphere map to the plane? How does the mapping depend on $n$? And on $p$? 
    1. Map the data to a $p$-dimensional unit sphere by dividing each observation with its $l_2$ norm: `map2sphere <- function(x) x/sqrt(sum(x^2))`. Repeat the previous embeddings. Does this structureless data embeds itself with structure?
    1. Introduce artificial "structure" in the data and repeat the previous exercise. Use the Factor Analysis generative model in Eq.(\@ref(eq:factor)) to generate $p$ dimensional data along a one-dimensional line. Can you see that observations arrange themselves along a single line in after your plane embedding?

1. Read about the Iris dataset using `?iris`. "Forget" the `Species` column to make the problem unsupervised.
    1. Make pairs of scatter plots. Can you identify the clusters in the data? 
    1. Perform K-means with `centers=3`. To extract the clustering results (cluster of each instance) use `kmeans$clusters`. Now recall the `Species` column to verify your clustering. 
    1. Perform hierarchical clustering with `hclust`, `method=”single”` and `method=”average”`.Extract the clustering results with `cutree`.  Compare the accuracy of the two linkage methods.
    1. Perform PCA on the data with `prcomp` function. 
    1. Print the Rotation matrix.
    1. Print the PCA’s vectors with `pca$x`. These vectors are the new values for each instance in the dataset after the rotation. 
    1. Let’s look at the first component (PC1) with `plot(pca$x[,1])`  (i.e reduce the dimensionality from 4 to 1 features). Can you identify visually the three clusters (species)?    
    1. Determine the color of the points to be the truth species with `col=iris$Species`.

See DataCap's 
[Unsupervised Learning in R](https://www.datacamp.com/courses/unsupervised-learning-in-r),
[Cluster Analysis in R](https://www.datacamp.com/courses/cluster-analysis-in-r), 
[Dimensionality Reduction in R](https://www.datacamp.com/courses/dimensionality-reduction-in-r), and
[Advanced Dimensionality Reduction in R](https://www.datacamp.com/courses/advanced-dimensionality-reduction-in-r)
for more self practice.