-
Notifications
You must be signed in to change notification settings - Fork 8
/
60-unsupervised.Rmd
982 lines (674 loc) · 44.6 KB
/
60-unsupervised.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
---
output: html_document
editor_options:
chunk_output_type: console
---
# Unsupervised Learning {#unsupervised}
This chapter deals with machine learning problems which are unsupervised.
This means the machine has access to a set of inputs, $x$, but the desired outcome, $y$ is not available.
Clearly, learning a relation between inputs and outcomes is impossible, but there are still a lot of problems of interest.
In particular, we may want to find a compact representation of the inputs, be it for visualization of further processing.
This is the problem of _dimensionality reduction_.
For the same reasons we may want to group similar inputs. This is the problem of _clustering_.
In the statistical terminology, with some exceptions, this chapter can be thought of as multivariate __exploratory__ statistics.
For multivariate __inference__, see Chapter \@ref(multivariate).
## Dimensionality Reduction {#dim-reduce}
```{example, label='bmi'}
Consider the heights and weights of a sample of individuals.
The data may seemingly reside in $2$ dimensions but given the height, we have a pretty good guess of a persons weight, and vice versa.
We can thus state that heights and weights are not really two dimensional, but roughly lay on a $1$ dimensional subspace of $\mathbb{R}^2$.
```
```{example, label='iq'}
Consider the correctness of the answers to a questionnaire with $p$ questions.
The data may seemingly reside in a $p$ dimensional space, but if there is a thing such as "skill", then given the correctness of a person's reply to a subset of questions, we have a good idea how he scores on the rest.
If skill is indeed a one dimensional quality, then the questionnaire data should organize around a single line in the $p$ dimensional cube.
```
```{example, label='blind-signal'}
Consider $n$ microphones recording an individual.
The digitized recording consists of $p$ samples.
Are the recordings really a shapeless cloud of $n$ points in $\mathbb{R}^p$?
Since they all record the same sound, one would expect the $n$ $p$-dimensional points to arrange around the original, noisless, sound: a single point in $\mathbb{R}^p$.
If microphones have different distances to the source, volumes and echoes may differ.
We would thus expect the $n$ points to arrange about a __line__ in $\mathbb{R}^p$.
```
### Principal Component Analysis {#pca}
_Principal Component Analysis_ (PCA) is such a basic technique, it has been rediscovered and renamed independently in many fields.
It can be found under the names of
Discrete Karhunen–Loève Transform; Hotteling Transform; Proper Orthogonal Decomposition; Eckart–Young Theorem; Schmidt–Mirsky Theorem; Empirical Orthogonal Functions; Empirical Eigenfunction Decomposition; Empirical Component Analysis; Quasi-Harmonic Modes; Spectral Decomposition; Empirical Modal Analysis, and possibly more^[http://en.wikipedia.org/wiki/Principal_component_analysis].
The many names are quite interesting as they offer an insight into the different problems that led to PCA's (re)discovery.
Return to the BMI problem in Example \@ref(exm:bmi).
Assume you wish to give each individual a "size score".
Also assume this score is a __linear__ combination of height and weight.
That is the problem solved by PCA:
It returns the linear combination that has the largest variability, i.e., the combination which best distinguishes between individuals.
The variance maximizing motivation above was the one that guided @hotelling1933analysis.
But $30$ years before him, @pearson1901liii derived the same procedure with a different motivation in mind.
Pearson was also trying to give each individual a score.
He did not care about variance maximization, however.
He simply wanted a small set of coordinates in some (linear) space that approximates the original data well.
Before we proceed, we give an example to fix ideas.
Consider the crime rate data in `USArrests`, which encodes reported murder events, assaults, rapes, and the urban population of each american state.
```{r}
head(USArrests)
```
Following Hotelling's motivation, we may want to give each state a "crimilality score".
We first remove the `UrbanPop` variable, which does not encode crime levels.
We then z-score each variable with `base::scale()`, and call PCA for a sequence of $1,\dots,3$ criminality scores that best separate between states.
```{r, echo=FALSE}
library(magrittr)
```
```{r, cache=TRUE}
USArrests.1 <- USArrests[,-3] %>% scale
pca.1 <- prcomp(USArrests.1, scale = TRUE)
pca.1
```
Things to note and terminology:
- The score that best distinguishes between states should be indifferent to the __average__ of each variable.
We also don't want the score to be sensitive to the measurement scale.
Formally, we want the scores to be _affine invariant_.
We thus perform PCA in the z-score scale of each variable, obtained with the `scale` function.
- PCA is performed with the `stats::prcomp` function.
It returns the contribution (weight) of the original variables, to the new crimeness score. After rescaling, these weights are called the _loadings_.
Borrowing from the __Factor Analaysis__ literature, the loadings may be called _Rotations_, which is their name in the `stats::prcomp()` output. If you are confused between weights, loadings and rotations, see this [Cross Validated](https://stats.stackexchange.com/questions/143905/loadings-vs-eigenvectors-in-pca-when-to-use-one-or-another) entry.
- The number of possible scores, is the same as the number of original variables in the data.
- The new scores are called the _principal components_, labeled `PC1`,...,`PC3` in our output. They are computed by summing the original variables weighted by their loadings.
- The loadings/rotation on PC1 tell us that the best separation between states is along the average crime rate.
Why is this?
Because all the $3$ crime variables have a similar loading on PC1.
- The other PCs are slightly harder to interpret, but it is an interesting exercise.
__If we now represent each state, not with its original $3$ crimes measurements variables, but only with the first $2$ PCs (for example), we have reduced the dimensionality of the data.__
#### Mathematics of PCA
What is the mathematical problem that is actually solved with PCA?
Finding a linear combination ($v$) of the original variables ($x$), so that the new score/index ($v'x$) best separates individuals.
Best separation implies that the variance of $v'x$ is maximal.
Clearly, $Var[v'x]$ may explode if any $v$ is allowed, so we need to pick $v$ from some "fair" set.
It is most convenient, mathematically, to constrain the $l_2$ norm to some constant: $\Vert v \Vert^2_2=\sum v_j^2=1$.
The first "best separating score", known as the first _principal component_ (PC), is thus
$$v_1'x \quad s.t. \quad v_1=argmax_{v}\{Var[v'x], \text{ and } \Vert v \Vert_2=1 \} .$$
The second PC, is the same, only that it is required to be orthogonal to the first PC:
$$v_2'x \quad s.t. \quad v_2=argmax_{v}\{Var[v'x], \text{ and } \Vert v \Vert_2=1, \text{ and } v'v_1=0 \} .$$
The construction of the next PCs follows the same lines: find a linear transformation of the data that best separates observations and is orthogonal to the previous PCs.
#### How Hard is the PCA Problem?
Estimating all the PCs in the data is well defined algebraically if $n>p$, in which case $p$ PCs are computable.
This is the algebraic part of the problem, which is rather easy, and solved with [SVD](https://en.wikipedia.org/wiki/Singular-value_decomposition).
If viewing PCA as inference tool, we may ask about its statistical performance.
It turns out that PCA has the same statistical difficulty as estimating a covariance matrix.
As we already saw in the Multivariate Statistics Chapter (\@ref(multivariate)), estimating covariances is a hard task, we thus recommend: don't trust your PCs if $n$ is not much larger than $p$, and see the bibliographic notes for further details.
### Dimensionality Reduction Preliminaries
Before presenting methods other than PCA, we need some terminology.
- __Variable__:
A.k.a. _dimension_, or _feature_, or _column_. A vector of $p$ measurements in their raw scale.
- __Feature Mapping__:
A.k.a. _variable transformation_, or _data augmentation_. A measurement in a new, transformed, scale.
- __Data__:
A.k.a. _sample_, _observations_.
Will typically consist of $n$, vectors of dimension $p$.
We typically denote the data as a $n\times p$ matrix $X$.
- __Data space__:
A.k.a. _feature space_. The space of all possible values of $X$. We denote with $\mathcal{X}$.
- __Network__:
A representation of the similarities (or dissimilarities) between the $n$ units in the data. We denote with $\mathcal{G}$, and may be encoded in an $n \times n$ matrix.
- __Manifold__:
A generalization of a linear space, which is regular enough so that, __locally__, it has all the properties of a linear space.
We will denote an arbitrary manifold by $\mathcal{M}$, and by $\mathcal{M}_q$ a $q$ dimensional^[You are probably used to thinking of the __dimension__ of linear spaces. We will not rigorously define what is the dimension of a manifold, but you may think of it as the number of free coordinates needed to navigate along the manifold.] manifold.
- __Embedding__:
Informally speaking: a "shape preserving" mapping (see figure below). We denote an embedding of the data $X$, into a manifold $\mathcal{M}$ by $X\mapsto \mathcal{M}$.
- __Embedding Function__:
If the embedding is not only an algorithm, but rather, has a functional form representation, we call it an _embedding function_ $f$. Given such a function, we are not restricted to embeddings of the original data, $X$, but may also embed new data points from $\mathcal{X}$: $f:\mathcal{X}\mapsto\mathcal{M}$.
- __Generative Model__:
Known to statisticians as the __sampling distribution__.
The assumed stochastic process that generated the observed $X$.
![Various embedding algorithms. No embedding of the sphere to the plane is perfect. This is obviously not new. Maps makers have known this for centuries!](art/sphx_glr_plot_manifold_sphere_001.png)
There are many motivations for dimensionality reduction:
1. __Scoring__:
Give each observation an interpretable, simple score (Hotelling's motivation).
1. __Latent structure__:
Recover unobservable information from indirect measurements.
E.g: Blind signal reconstruction, CT scan, cryo-electron microscopy, etc.
1. __Signal to Noise__:
Denoise measurements before further processing like clustering, supervised learning, etc.
1. __Compression__:
Save on RAM ,CPU, and communication when operating on a lower dimensional representation of the data.
### Latent Variable Generative Approaches
All generative approaches to dimensionality reduction will include a set of latent/unobservable variables, which we can try to recover from the observables $X$.
The unobservable variables will typically have a lower dimension than the observables, thus, dimension is reduced.
We start with the simplest case of linear Factor Analysis.
#### Factor Analysis (FA) {#factor-analysis}
FA originates from the psychometric literature.
We thus revisit the IQ (actually g-factor^[https://en.wikipedia.org/wiki/G_factor_(psychometrics)]) Example \@ref(exm:iq):
```{example}
Assume $n$ respondents answer $p$ quantitative questions: $x_i \in \mathbb{R}^p, i=1,\dots,n$.
Also assume, their responses are some linear function of a single personality attribute, $s_i$.
We can think of $s_i$ as the subject's "intelligence".
We thus have
\begin{align}
x_i = s_i A + \varepsilon_i
\end{align}
And in matrix notation:
\begin{align}
X = S A+\varepsilon,
(\#eq:factor)
\end{align}
where $A$ is the $q \times p$ matrix of factor loadings, and $S$ the $n \times q$ matrix of latent personality traits.
In our particular example where $q=1$, the problem is to recover the unobservable intelligence scores, $s_1,\dots,s_n$, from the observed answers $X$.
```
We may try to estimate $S A$ by assuming some distribution on $S$ and $\varepsilon$ and apply maximum likelihood.
Under standard assumptions on the distribution of $S$ and $\varepsilon$, recovering $S$ from $\widehat{S A }$ is still impossible as there are infinitely many such solutions.
In the statistical parlance we say the problem is _non identifiable_, and in the applied mathematics parlance we say the problem is _ill posed_.
```{remark}
The non-uniqueness (non-identifiability) of the FA solution under variable rotation is never mentioned in the PCA context.
Why is this?
This is because PCA and FA solve different problems.
$\hat S$ in PCA is well defined because PCA does not seek a single $S$ but rather a __sequence__ of $S_q$ with dimensions growing from $q=1$ to $q=p$.
```
The FA terminology is slightly different than PCA:
- __Factors__:
The unobserved attributes $S$.
Akin to the _principal components_ in PCA.
- __Loading__:
The $A$ matrix; the contribution of each factor to the observed $X$.
- __Rotation__:
An arbitrary orthogonal re-combination of the factors, $S$, and loadings, $A$, which changes the interpretation of the result.
The FA literature offers several heuristics to "fix" the identifiability problem of FA.
These are known as _rotations_, and go under the names of _Varimax_, _Quartimax_, _Equimax_, _Oblimin_, _Promax_, and possibly others.
Because of their great similarity, FA is often confused with PCA.
For a discussion of the similarities and dissimilarities, see this excellent [StackExchange Q](https://stats.stackexchange.com/questions/123063/is-there-any-good-reason-to-use-pca-instead-of-efa-also-can-pca-be-a-substitut).
#### Independent Component Analysis (ICA)
Like FA, _independent compoent analysis_ (ICA) is a family of latent space models, thus, a _meta-method_.
It assumes the observables are some function of the latent variables $S$.
In many cases this function is assumed to be linear in $S$ so that ICA is compared, if not confused, with PCA and even more so with FA.
The fundamental idea of ICA is that $S$ has a joint distribution of __non-Gaussian__, __independent__ variables.
This independence assumption, solves the the non-uniqueness of $S$ in FA.
As such, it can be thought of as a type of rotation in FA.
Then again, if the assumed distribution of $S$ is both non-Gaussian, and well justified, then ICA is well defined, and more than just an arbitrary rotation of FA.
Being a generative model, estimation of $S$ can then be done using maximum likelihood, or other estimation principles.
ICA is a popular technique in signal processing, where $A$ is actually the signal, such as sound in Example \@ref(exm:blind-signal).
Recovering $A$ is thus recovering the original signals mixing in the recorded $X$.
### Purely Algorithmic Approaches
We now discuss dimensionality reduction approaches that are not stated via their generative model, but rather, directly as an algorithm.
This does not mean that they cannot be cast via their generative model, but rather they were not motivated as such.
#### Multidimensional Scaling (MDS)
MDS can be thought of as a variation on PCA, that begins with the $n \times n$ graph of distances between data points $\mathcal{G}$; in contrast to PCA which operates on the original $n \times p$ data $X$.
The term _graph_ is typically used in this context, but saying _network_ instead of _graph_ is more accurate. This is because a graph encodes connections (topology) and networks encode distances (geometry). Put differently, a graph can be encoded in a matrix of zeroes and ones, and a network in a matrix of real numbers.
MDS aims at embedding $\mathcal{G}$ into the plane, typically for visualization, while preserving the original distances.
Basic results in graph/network theory suggest that the geometry of a graph cannot be preserved when embedding it into lower dimensions [@graham1988isometric].
The different types of MDSs, such as _Classical MDS_, and _Sammon Mappings_, differ in the _stress function_ that penalizes for the geometric distortion caused by the embedding.
#### Local Multidimensional Scaling (local-MDS) {#loc-mds}
```{example, label='non-euclidean'}
Consider data, $X$, of Cartesian coordinates on the globe.
At short distances, constructing a dissimilarity graph, $X \mapsto \mathcal{G}$ using Euclidean distances between Cartesian coordinates will capture the true distance between points.
At long distances, however, the Euclidean distances, are a very poor aproximation of the distance to travel between points on the globe.
A more extreme example is coordinates on the brain's cerebral cortex.
Being a highly folded surface, the Euclidean distance between points is far from the true geodesic distances along the cortex's surface^[Then again, it is possible that the true distances are the white matter fibers connecting going within the cortex, in which case, Euclidean distances are more appropriate than geodesic distances. We put that aside for now.].
```
local-MDS is aimed at solving the case where Euclidean distances, implied by PCA and FA, are a bad measure of distance.
Instead of using the graph of Euclidean distances between any two points, $\mathcal{G}=X'X$, local-MDS computes $\mathcal{G}$ starting with the Euclidean distance between pairs of nearest points.
Longer distances are solved as a [shortest path problem](https://en.wikipedia.org/wiki/Shortest_path_problem).
For instance, using the [Floyd–Warshall algorithm](https://en.wikipedia.org/wiki/Floyd–Warshall_algorithm), which sums distances between closest points.
This is akin to computing the distance between Jerusalem to Beijing by computing Euclidean distances between Jerusalem-Bagdad, Bagdad-Teheran, Teheran-Ashgabat, Ashgabat-Tashkent,and so on. Because the [geographical-distance](https://en.wikipedia.org/wiki/Geographical_distance) between nearby cities is well approximated with the Euclidean distance, summing local distanes is better than operating directly with the Euclidean distance between Jerusalem and Beijing.
After computing $\mathcal{G}$, local-MDS ends with the usual MDS for the embedding.
Because local-MDS ends with a regular MDS, it can be seen as a non-linear embedding into a linear manifold $\mathcal{M}$.
#### Isometric Feature Mapping (IsoMap) {#isomap}
Like localMDS, only that the embedding, and not only the computation of the distances, is local.
#### Local Linear Embedding (LLE)
Very similar to IsoMap \@ref(isomap).
#### t-SNE
t-SNE is a recently popularized visualization method for high dimentional data.
t-SNE starts by computing a proximity graph, $\mathcal{G}$.
Computation of distances in the graph assumes a Gaussian decay of distances.
Put differently: only the nearest observations have a non-vanishing similarity.
This stage is similar (in spirit) to the growing of $\mathcal{G}$ in local-MDS (\@ref(loc-mds)).
The second stage in t-SNE consists of finding a mapping to 2D (or 3D), which conserves distances in $\mathcal{G}$.
The uniquness of t-SNE compared to other space embeddings is in the way distances are computed in the target 2D (or 3D) space.
#### UMAP
https://pair-code.github.io/understanding-umap/
#### Force Directed Graph Drawing
This class of algorithms start with a proximty graph $\mathcal{G}$, and define a set of phisically motivated "forces", operating between data-points.
Think of $\mathcal{G}$ as governing a set of springs between data points.
These springs have some steady-state.
The location of points in the embedding corrsponds to the steady state of this system of springs.
#### Kernel PCA (kPCA)
Returning to the BMI example (\@ref(exm:bmi)); what if we want to learn scores that best separate between individuals, but unlike PCA, are non-linear in the original features.
Kernel PCA does just that, only that it restricts the possible scores to simple functions of the original variables.
These functions are known as the _feature mapping_.
The feature mappings resides in a function space called Reproducing Kernel Hilbert Space (RKHS), thus giving kPCA it's name.
By defining a _Kernel_ one defines the class of feature mappings implied by the algorithm.
The magic of kPCA, like other kernel methods, is that even if one chooses a kernel maps $p$ features, to an infinte dimensional space, the solution to the kPCA problem has a closed form solution.
This implies that theoreticians may study the statisticla properties of kPCA, and solutions do not require solving optimization probelm in untractable function spaces.
#### Sparse PCA (sPCA)
sPCA is a type of PCA where the loadings are sparse.
This means that PCs are linear combinations of a small number of variables.
This makes sPCA easier to interpret.
Note that the _varimax_ rotation in factor-analysis (\@ref(factor-analysis)) has a similar goal: create factors with the smallest number of contributing variables, so that they are easy to explain.
#### Sparse kernel PCA (skPCA)
A marriage between sPCA and kPCA: generate scores that are non linear transformations of a small number of variables, each.
#### Correspondence Analysis (CA)
What if $x$ is not continuous, i.e., $\mathcal{X}\neq \mathbb{R}^p$?
We could dummy-code $x$, and then use plain PCA.
A more principled view, when $x$ is categorical, is known as _Correspondence Analysis_.
### Dimensionality Reduction in R
#### PCA {#pca-in-r}
We already saw the basics of PCA in \@ref(pca).
The fitting is done with the `stats::prcomp` function.
The _bi-plot_ is a useful way to visualize the output of PCA.
```{r}
# library(devtools)
# install_github("vqv/ggbiplot")
ggbiplot::ggbiplot(pca.1)
```
Things to note:
- We used the `ggbiplot::ggbiplot` function (available from github, but not from CRAN), because it has a nicer output than `stats::biplot`.
- The data is presented in the plane of PC1 and PC2.
- The bi-plot plots the loadings as arrows. The coordinates of the arrows belong to the weight of each of the original variables in each PC.
For example, the x-value of each arrow is the loadings on the PC1.
Since the weights of Murder, Assault, and Rape are almost the same, we conclude that PC1 captures the average crime rate in each state.
The _scree plot_ depicts the quality of the approximation of $X$ as $q$ grows, i.e., as we allow increase the dimension of our new score.
This is depicted using the proportion of variability in $X$ that is removed by each added PC.
It is customary to choose $q$ as the first PC that has a relative low contribution to the approximation of $X$.
This is known as the "knee heuristic".
```{r scree}
ggbiplot::ggscreeplot(pca.1)
```
The scree plot suggests a PC1 alone captures about 0.8 of the variability in crime levels.
The next plot, is the classical class-room introduction to PCA.
It shows that states are indeed arranged along a single line in the "Assault-Murder" plane. This line is PC1.
```{r, echo=FALSE}
USArrests.1 <- USArrests[,-3] %>% scale
load <- pca.1$rotation
slope <- load[2, ]/load[1, ]
mn <- apply(USArrests.1, 2, mean)
intcpt <- mn[2] - (slope * mn[1])
# scatter plot with the two new axes added
USArrests.2 <- USArrests[,1:2] %>% scale
xlim <- range(USArrests.2) # overall min, max
plot(USArrests.2, xlim = xlim, ylim = xlim, pch = 16, col = "purple") # both axes same length
abline(intcpt[1], slope[1], lwd = 2) # first component solid line
abline(intcpt[2], slope[2], lwd = 2, lty = 2) # second component dashed
legend("right", legend = c("PC 1", "PC 2"), lty = c(1, 2), lwd = 2, cex = 1)
# projections of points onto PCA 1
y1 <- intcpt[1] + slope[1] * USArrests.2[, 1]
x1 <- (USArrests.1[, 2] - intcpt[1])/slope[1]
y2 <- (y1 + USArrests.1[, 2])/2
x2 <- (x1 + USArrests.1[, 1])/2
segments(USArrests.1[, 1], USArrests.1[, 2], x2, y2, lwd = 2, col = "purple")
```
More implementations of PCA:
```{r many PCA implementations, eval=FALSE}
# FAST solutions:
gmodels::fast.prcomp()
# More detail in output:
FactoMineR::PCA()
# For flexibility in algorithms and visualization:
ade4::dudi.pca()
# Another one...
amap::acp()
```
#### FA
```{r FA, cache=TRUE}
fa.1 <- psych::principal(USArrests.1, nfactors = 2, rotate = "none")
fa.1
biplot(fa.1, labels = rownames(USArrests.1))
```
Numeric comparison with PCA:
```{r}
fa.1$loadings
pca.1$rotation
```
Things to note:
- We perform FA with the `psych::principal` function. The `Principal Component Analysis` title is due to the fact that FA without rotations, is equivalent to PCA.
- The first factor (`fa.1$loadings`) has different weights than the first PC (`pca.1$rotation`) because they have different normalizations. They have the same interpretation however: all weights of the first component are simiar, suggesting it merely captures the average crime rate.
Graphical model fans will like the following plot, where the contribution of each variable to each factor is encoded in the width of the arrow.
```{r, eval=TRUE}
qgraph::qgraph(fa.1$loadings)
```
Let's add a rotation (Varimax), and note that the rotation has indeed changed the loadings of the variables, thus the interpretation of the factors.
```{r varimax, cache=TRUE}
fa.2 <- psych::principal(USArrests.1, nfactors = 2, rotate = "varimax")
fa.2$loadings
```
Things to note:
- FA with a rotation is no longer equivalent to PCA.
- The rotated factors are now called _rotated componentes_, and reported in `RC1` and `RC2`.
#### ICA
```{r ICA, cache=TRUE}
ica.1 <- fastICA::fastICA(USArrests.1, n.com=2) # Also performs projection pursuit
plot(ica.1$S)
abline(h=0, v=0, lty=2)
text(ica.1$S, pos = 4, labels = rownames(USArrests.1))
# Compare with PCA (first two PCs):
arrows(x0 = ica.1$S[,1], y0 = ica.1$S[,2],
x1 = -pca.1$x[,2], y1 = pca.1$x[,1],
col='red', pch=19, cex=0.5)
```
Things to note:
- We fit ICA with `fastICA::fastICA`.
- The ICA components, like any other rotated components, are different than the PCA components.
- The fastICA algorithm has a stochastic component. So the solution will be different at each re-run (making comparison to PCA harder).
#### MDS
Classical MDS compared to PCA.
```{r MDS, results='hold', cache=TRUE}
# We first need a dissimarity matrix/graph:
state.disimilarity <- dist(USArrests.1)
mds.1 <- cmdscale(state.disimilarity)
plot(mds.1, pch = 19)
abline(h=0, v=0, lty=2)
USArrests.2 <- USArrests[,1:2] %>% scale
text(mds.1, pos = 4, labels = rownames(USArrests.2), col = 'tomato')
# Compare with PCA (first two PCs):
points(pca.1$x[,1:2], col='red', pch=19, cex=0.5)
```
Things to note:
- We first compute a dissimilarity graph with `stats::dist()`. See `cluster::daisy` for a wider variety of dissimilarity measures.
- We learn the MDS embedding with `stats::cmdscale`.
- Embedding with the first two components of PCA is exactly the same as classical MDS with Euclidean distances.
Let's try other strain functions for MDS, like Sammon's stress, and compare it with PCA.
```{r SammonMDS, results='hold'}
mds.2 <- MASS::sammon(state.disimilarity, trace = FALSE)
plot(mds.2$points, pch = 19)
abline(h=0, v=0, lty=2)
text(mds.2$points, pos = 4, labels = rownames(USArrests.2))
# Compare with PCA (first two PCs):
arrows(
x0 = mds.2$points[,1], y0 = mds.2$points[,2],
x1 = pca.1$x[,1], y1 = pca.1$x[,2],
col='red', pch=19, cex=0.5)
```
Things to note:
- `MASS::sammon` does the embedding.
- Sammon stress is different than PCA.
#### t-SNE
For a native R implementation: [tsne package](https://cran.r-project.org/web/packages/tsne/).
For an R wrapper for C libraries: [Rtsne package](https://github.com/jkrijthe/Rtsne).
The [MNIST](http://yann.lecun.com/exdb/mnist/) image bank of hand-written images has its own data format. The import process is adapted from [David Dalpiaz](https://gist.github.com/daviddalpiaz/ae62ae5ccd0bada4b9acd6dbc9008706):
```{r}
show_digit <- function(arr784, col = gray(12:1 / 12), ...) {
image(matrix(as.matrix(arr784[-785]), nrow = 28)[, 28:1], col = col, ...)
}
# load image files
load_image_file <- function(filename) {
ret <- list()
f <- file(filename, 'rb')
readBin(f, 'integer', n = 1, size = 4, endian = 'big')
n <- readBin(f, 'integer', n = 1, size = 4, endian = 'big')
nrow <- readBin(f, 'integer', n = 1, size = 4, endian = 'big')
ncol <- readBin(f, 'integer', n = 1, size = 4, endian = 'big')
x <- readBin(f, 'integer', n = n * nrow * ncol, size = 1, signed = FALSE)
close(f)
data.frame(matrix(x, ncol = nrow * ncol, byrow = TRUE))
}
# load label files
load_label_file <- function(filename) {
f <- file(filename, 'rb')
readBin(f, 'integer', n = 1, size = 4, endian = 'big')
n <- readBin(f, 'integer', n = 1, size = 4, endian = 'big')
y <- readBin(f, 'integer', n = n, size = 1, signed = FALSE)
close(f)
y
}
# load images
train <- load_image_file("data/train-images-idx3-ubyte")
test <- load_image_file("data/t10k-images-idx3-ubyte")
# load labels
train$y = as.factor(load_label_file("data/train-labels-idx1-ubyte"))
test$y = as.factor(load_label_file("data/t10k-labels-idx1-ubyte"))
```
Inspect some digits:
```{r}
par(mfrow=c(3,3))
ind <- sample(1:nrow(train),9)
for(i in 1:9){
show_digit(train[ind[i],], main=paste('Label= ',train$y[ind[i]], sep='')) }
```
The analysis is adapted from [Shruti Marwaha](https://rpubs.com/marwahsi/tnse).
```{r}
numTrain <- 5e3 # Subset data for speed
rows <- sample(1:nrow(train), numTrain)
train.sub <- train[rows, -which(names(train)=='y')] %>% as.matrix
train.sub.labs <- train[rows, which(names(train)=='y')]
tsne <- Rtsne::Rtsne(train.sub, dims = 2, perplexity=30, verbose=FALSE, max_iter = 500)
colors <- rainbow(length(unique(train.sub.labs)))
names(colors) <- unique(train.sub.labs)
par(mgp=c(2.5,1,0))
par(mfrow=c(1,1))
plot(tsne$Y, t='n',
main="tSNE",
xlab="tSNE dimension 1",
ylab="tSNE dimension 2",
"cex.main"=2,
"cex.lab"=1.5)
text(tsne$Y, labels=train.sub.labs, col=colors[train.sub.labs])
```
#### Force Embedding
I am unaware of an R implementation of force-embedding.
Maybe because of the interactive nature of the algorithm, that is not suited for R.
Force embedding is much more natural to interactive GUIs.
Here is a link for a fun [javascript implementation](http://bl.ocks.org/eesur/be2abfb3155a38be4de4).
#### Sparse PCA
```{r sPCA}
# Compute similarity graph
state.similarity <- MASS::cov.rob(USArrests.1)$cov
spca1 <- elasticnet::spca(state.similarity, K=2, type="Gram", sparse="penalty", trace=FALSE, para=c(0.06,0.16))
spca1$loadings
```
Things to note:
- I used the `elasticnet::spca` function.
- Is the solutions sparse? Yes! PC2 depends on a single variable only: `Murder`.
#### Kernel PCA
```{r kPCA, eval=TRUE}
library(kernlab)
kpc <- kpca(~.,data=as.data.frame(USArrests.1), kernel="rbfdot", kpar=list(sigma=0.2), features=2)
plot(rotated(kpc),
xlab="1st Principal Component",
ylab="2nd Principal Component")
abline(h=0, v=0, lty=2)
text(rotated(kpc), pos = 4, labels = rownames(USArrests.2))
```
Things to note:
- We used `kernlab::kpca` for kPCA.
- `rotated` projects the data on its principal components (the above "scores").
- See `?'kpca-class'` or `?rotated` for help on available utility functions.
- `kernel=` governs the class of feature mappings.
- `kpar=list(sigma=0.2)` provides parameters specific to each type of kernel. See `?kpca`.
- `features=2` is the number of principal components (scores) to learn.
- You may notice the "Horseshoe" pattern of the kPCA embedding, a.k.a. "Croissants", or the _Guttman Effect_. The horseshoe is known phenomenon in low dimensional visualizations. See [J. De Leeuw's online paper](https://rpubs.com/deleeuw/133786) for more details.
#### Multiple Correspondence Analysis (MCA)
See @izenman2008modern.
\pagebreak
## Clustering {#cluster}
```{example, label="photos"}
Consider the tagging of your friends' pictures on Facebook.
If you tagged some pictures, Facebook may try to use a supervised approach to automatically label photos.
If you never tagged pictures, a supervised approach is impossible.
It is still possible, however, to group simiar pictures together.
```
```{example, label="spam"}
Consider the problem of spam detection.
It would be nice if each user could label several thousands emails, to apply a supervised learning approach to spam detection.
This is an unrealistic demand, so a _pre-clustering_ stage is useful: the user only needs to label a couple dozens of (hopefully homogenous) clusters, before solving the supervised learning problem.
```
In clustering problems, we seek to group observations that are similar.
There are many motivations for clustering:
1. __Understanding__:
The most common use of clustering is probably as a an exploratory step, to identify homogeneous groups in the data^[As such, you may wonder why clustering is part of machine learning at all? Maybe call it _machine-assisted human-learning_?].
1. __Dimensionality reduction__:
Clustering may be seen as a method for dimensionality reduction.
Unlike the approaches in the Dimensionality Reduction Section \@ref(dim-reduce), it does not compress __variables__ but rather __observations__.
Each group of homogeneous observations may then be represented as a single _prototypical observation_ of the group.
1. __Pre-Labelling__:
Clustering may be performed as a pre-processing step for supervised learning, when labeling all the samples is impossible due to "budget" constraints, like in Example \@ref(exm:spam). This is sometimes known as _pre-clustering_.
Clustering, like dimensionality reduction, may rely on some latent variable generative model, or on purely algorithmic approaches.
### Latent Variable Generative Approaches
#### Finite Mixture {#finite-mixture}
```{example, label="males-females"}
Consider the distribution of heights.
Heights have a nice bell shaped distribution within each gender.
If genders have not been recorded, heights will be distributed like a _mixture_ of males and females.
The gender in this example, is a _latent_ variable taking $K=2$ levels: male and female.
```
A _finite mixture_ is the marginal distribution of $K$ distinct classes, when the class variable is _latent_.
This is useful for clustering:
If we know the distribution of the sub-populations being mixed, we may simply cluster each data point to the most likely sub-group.
Learning how is each sub-population disctirubted, when no class labels are available is no easy task, but it is possible.
For instance, by means of maximum likelihood.
### Purely Algorithmic Approaches
#### K-Means
The _K-means_ algorithm is possibly the most popular clustering algorithm.
The goal behind K-means clustering is finding a representative point for each of K clusters, and assign each data point to one of these clusters.
As each cluster has a representative point, this is also known as a _prototype method_.
The clusters are defined so that they minimize the average Euclidean distance between all points to the center of the cluster.
In K-means, the clusters are first defined, and then similarities computed.
This is thus a _top-down_ method.
K-means clustering requires the raw features $X$ as inputs, and not only a similarity graph, $\mathcal{G}$.
This is evident when examining the algorithm below.
The k-means algorithm works as follows:
1. Choose the number of clusters $K$.
1. Arbitrarily assign points to clusters.
1. While clusters keep changing:
1. Compute the cluster centers as the average of their points.
1. Assign each point to its closest cluster center (in Euclidean distance).
1. Return cluster assignments and means.
\bigskip
```{remark}
If trained as a statistician, you may wonder- what population quantity is K-means actually estimating?
The estimand of K-means is known as the _K principal points_.
Principal points are points which are _self consistent_, i.e., they are the mean of their neighbourhood.
```
#### K-Means++
_K-means++_ is a fast version of K-means thanks to a smart initialization.
#### K-Medoids
If a Euclidean distance is inappropriate for a particular set of variables, or that robustness to corrupt observations is required, or that we wish to constrain the cluster centers to be actual observations, then the _K-Medoids_ algorithm is an adaptation of K-means that allows this.
It is also known under the name _partition around medoids_ (PAM) clustering, suggesting _its relation to _graph partitioning_: a very important and well-studied problem in computer sciences.
The k-medoids algorithm works as follows.
1. Given a dissimilarity graph, $\mathcal{G}$.
1. Choose the number of clusters $K$.
1. Arbitrarily assign points to clusters.
1. While clusters keep changing:
1. Within each cluster, set the center as the data point that minimizes the sum of distances to other points in the cluster.
1. Assign each point to its closest cluster center.
1. Return Cluster assignments and centers.
\bigskip
```{remark}
If trained as a statistician, you may wonder- what population quantity is K-medoids actually estimating?
The estimand of K-medoids is the median of their neighbourhood.
A delicate matter is that quantiles are not easy to define for __multivariate__ variables so that the "multivaraitre median", may be a more subtle quantity than you may think.
See @small1990survey.
```
#### Hirarchial Clustering
Hierarchical clustering algorithms take dissimilarity graphs as inputs.
Hierarchical clustering is a class of greedy _graph-partitioning_ algorithms.
Being hierarchical by design, they have the attractive property that the evolution of the clustering can be presented with a _dendogram_, i.e., a tree plot. Another advantage of these methods is that they do not require an a-priori choice of the number of cluster ($K$).
Two main sub-classes of algorithms are _agglomerative_, and _divisive_.
_Agglomerative clustering_ algorithms are __bottom-up__ algorithm which build clusters by joining smaller clusters.
To decide which clusters are joined at each iteration some measure of distance between clusters is required:
- __Single Linkage__:
Cluster distance is defined by the distance between the two __closest__ members.
- __Complete Linkage__:
Cluster distance is defined by the distance between the two __farthest__ members.
- __Group Average__:
Cluster distance is defined by the __average__ distance between members.
- __Group Median__:
Like Group Average, only using the median.
_Divisive clustering_ algorithms are __top-down__ algorithm which build clusters by splitting larger clusters.
### Clustering in R
#### K-Means
The following code is an adaptation from [David Hitchcock](http://people.stat.sc.edu/Hitchcock/chapter6_R_examples.txt).
```{r kmeans, cache=TRUE}
k <- 2
kmeans.1 <- stats::kmeans(USArrests.1, centers = k)
head(kmeans.1$cluster) # cluster asignments
pairs(USArrests.1, panel=function(x,y) text(x,y,kmeans.1$cluster))
```
Things to note:
- The `stats::kmeans` function does the clustering.
- The cluster assignment is given in the `cluster` element of the `stats::kmeans` output.
- The visual inspection confirms that similar states have been assigned to the same cluster.
#### K-Medoids
Start by growing a distance graph with `dist` and then partition using `pam`.
```{r kmedoids, cache=TRUE}
state.disimilarity <- dist(USArrests.1)
kmed.1 <- cluster::pam(x= state.disimilarity, k=2)
head(kmed.1$clustering)
plot(pca.1$x[,1], pca.1$x[,2], xlab="PC 1", ylab="PC 2", type ='n', lwd=2)
text(pca.1$x[,1], pca.1$x[,2], labels=rownames(USArrests.1), cex=0.7, lwd=2, col=kmed.1$cluster)
```
Things to note:
- K-medoids starts with the computation of a dissimilarity graph, done by the `dist` function.
- The clustering is done by the `cluster::pam` function.
- Inspecting the output confirms that similar states have been assigned to the same cluster.
- Many other similarity measures can be found in `proxy::dist()`.
- See `cluster::clara()` for a big-data implementation of PAM.
#### Hirarchial Clustering
We start with agglomerative clustering with single-linkage.
```{r HirarchialClustering}
hirar.1 <- stats::hclust(state.disimilarity, method='single')
plot(hirar.1, labels=rownames(USArrests.1), ylab="Distance")
```
Things to note:
- The clustering is done with the `hclust` function.
- We choose the single-linkage distance using the `method='single'` argument.
- We did not need to a-priori specify the number of clusters, $K$, since all the possible $K$'s are included in the output tree.
- The `plot` function has a particular method for `hclust` class objects, and plots them as dendograms.
We try other types of linkages, to verify that the indeed affect the clustering.
Starting with complete linkage.
```{r complete linkage}
hirar.2 <- hclust(state.disimilarity, method='complete')
plot(hirar.2, labels=rownames(USArrests.1), ylab="Distance")
```
Now with average linkage.
```{r average linkage}
hirar.3 <- hclust(state.disimilarity, method='average')
plot(hirar.3, labels=rownames(USArrests.1), ylab="Distance")
```
If we know how many clusters we want, we can use `stats::cuttree` to get the class assignments.
```{r}
cut.2.2 <- cutree(hirar.2, k=2)
head(cut.2.2)
```
How to choose the number of clusters?
Just like the Scree Plot above, we can use a "knee heuristic".
Because the length of a tree's branch is proportional to distances, long branches imply inhomogenous groups, while short branches imply homogeneous groups.
Here is a little simulation to demonstrate this:
```{r}
n.groups <- 3 # set the number of groups
data.p <- 10 # set the dimension of the data
data.n <- 100 # set the number of samples
# data with no separation between groups
the.data.10 <- mvtnorm::rmvnorm(n = data.n, mean = rep(0,data.p))
data.disimilarity.10 <- dist(the.data.10)
hirar.10 <- hclust(data.disimilarity.10, method = "complete")
plot(hirar.10, ylab="Distance", main='All from the same group')
# data with strong separation between groups
the.data.11 <-the.data.10+sample(c(0,10,20), data.n, replace=TRUE) # Shift each group
data.disimilarity.11 <- dist(the.data.11)
hirar.11 <- hclust(data.disimilarity.11, method = "complete")
plot(hirar.11, ylab="Distance", main=paste('Strong Separation Between',n.groups, 'Groups'))
```
## Bibliographic Notes
An excellent reference on multivariate analysis (exploratory and inference) is @izenman2008modern.
For some theory of PCA see my [Dimensionality Reduction Class Notes](https://github.com/johnros/dim_reduce/blob/master/dim_reduce.pdf) and references therein.
For a SUPERB, interactive, visual demonstration of dimensionality reduction, see [Christopher Olah's](http://colah.github.io/posts/2014-10-Visualizing-MNIST/) blog.
And also [Jay Alammar's](https://jalammar.github.io/illustrated-word2vec/).
For t-SNE see the creator's site: [Laurens van der Maaten](https://lvdmaaten.github.io/tsne/).
For an excellent book on kernel methods (RKHS) see @shawe2004kernel.
For more on everything, see @friedman2001elements.
For a softer introduction (to everything), see @james2013introduction.
## Practice Yourself
1. Generate data from multivariate Gaussian data with `mvtnorm::rmvnorm()`. Clearly this data has no structure at all: it is a $p$-dimensional shapeless cloud of $n$ points.
1. Now try various dimensionality reduction algorithms such as PCA, MDS, kPCA, sPCA. How does the sphere map to the plane? How does the mapping depend on $n$? And on $p$?
1. Map the data to a $p$-dimensional unit sphere by dividing each observation with its $l_2$ norm: `map2sphere <- function(x) x/sqrt(sum(x^2))`. Repeat the previous embeddings. Does this structureless data embeds itself with structure?
1. Introduce artificial "structure" in the data and repeat the previous exercise. Use the Factor Analysis generative model in Eq.(\@ref(eq:factor)) to generate $p$ dimensional data along a one-dimensional line. Can you see that observations arrange themselves along a single line in after your plane embedding?
1. Read about the Iris dataset using `?iris`. "Forget" the `Species` column to make the problem unsupervised.
1. Make pairs of scatter plots. Can you identify the clusters in the data?
1. Perform K-means with `centers=3`. To extract the clustering results (cluster of each instance) use `kmeans$clusters`. Now recall the `Species` column to verify your clustering.
1. Perform hierarchical clustering with `hclust`, `method=”single”` and `method=”average”`.Extract the clustering results with `cutree`. Compare the accuracy of the two linkage methods.
1. Perform PCA on the data with `prcomp` function.
1. Print the Rotation matrix.
1. Print the PCA’s vectors with `pca$x`. These vectors are the new values for each instance in the dataset after the rotation.
1. Let’s look at the first component (PC1) with `plot(pca$x[,1])` (i.e reduce the dimensionality from 4 to 1 features). Can you identify visually the three clusters (species)?
1. Determine the color of the points to be the truth species with `col=iris$Species`.
See DataCap's
[Unsupervised Learning in R](https://www.datacamp.com/courses/unsupervised-learning-in-r),
[Cluster Analysis in R](https://www.datacamp.com/courses/cluster-analysis-in-r),
[Dimensionality Reduction in R](https://www.datacamp.com/courses/dimensionality-reduction-in-r), and
[Advanced Dimensionality Reduction in R](https://www.datacamp.com/courses/advanced-dimensionality-reduction-in-r)
for more self practice.