diff --git a/doc/basic_usage.rst b/doc/basic_usage.rst index fba7e5d7..05436966 100644 --- a/doc/basic_usage.rst +++ b/doc/basic_usage.rst @@ -142,7 +142,7 @@ can a dimension reduction technique like UMAP do for us? By reducing the dimension in a way that preserves as much of the structure of the data as possible we can get a visualisable representation of the data allowing us to "see" the data and its structure and begin to get some -inuitions about the data itself. +intuition about the data itself. To use UMAP for this task we need to first construct a UMAP object that will do the job for us. That is as simple as instantiating the class. So @@ -198,7 +198,7 @@ the original). This does a useful job of capturing the structure of the data, and as can be seen from the matrix of scatterplots this is relatively accurate. Of course we learned at least this much just from that matrix of -scatterplots -- which we could do since we only had four differnt +scatterplots -- which we could do since we only had four different dimensions to analyse. If we had data with a larger number of dimensions the scatterplot matrix would quickly become unwieldy to plot, and far harder to interpret. So moving on from the Iris dataset, let's consider @@ -362,7 +362,7 @@ of the reducer object, or call transform on the original data. We now have a dataset with 1797 rows (one for each hand-written digit sample), but only 2 columns. As with the Iris example we can now plot the resulting embedding, coloring the data points by the class that -theyr belong to (i.e. the digit they represent). +they belong to (i.e. the digit they represent). .. code:: python3 diff --git a/doc/clustering.rst b/doc/clustering.rst index f3ae97ac..e0bbbca7 100644 --- a/doc/clustering.rst +++ b/doc/clustering.rst @@ -4,14 +4,14 @@ Using UMAP for Clustering UMAP can be used as an effective preprocessing step to boost the performance of density based clustering. This is somewhat controversial, and should be attempted with care. For a good discussion of some of the -issues involved in this please see the various answers `in this +issues involved in this, please see the various answers `in this stackoverflow thread `__ on clustering the results of t-SNE. Many of the points of concern raised there are salient for clustering the results of UMAP. The most notable is that UMAP, like t-SNE, does not completely preserve density. UMAP, -like t-SNE, can also create tears in clusters that are not actually -present, resulting in a finer clustering than is necessarily present in +like t-SNE, can also create false tears in clusters, resulting in a +finer clustering than is necessarily present in the data. Despite these concerns there are still valid reasons to use UMAP as a preprocessing step for clustering. As with any clustering approach one will want to do some exploration and evaluation of the @@ -136,7 +136,7 @@ of largely spherical clusters -- this is responsible for some of the sharp divides that K-Means puts across digit classes. We can potentially improve on this by using a smarter density based algorithm. In this case we've chosen to try HDBSCAN, which we believe to be among the most -advanced density based tehcniques. For the sake of performance we'll +advanced density based techniques. For the sake of performance we'll reduce the dimensionality of the data down to 50 dimensions via PCA (this recovers most of the variance), since HDBSCAN scales somewhat poorly with the dimensionality of the data it will work on. diff --git a/doc/exploratory_analysis.rst b/doc/exploratory_analysis.rst index f44c79eb..b0c8f7e9 100644 --- a/doc/exploratory_analysis.rst +++ b/doc/exploratory_analysis.rst @@ -18,7 +18,7 @@ exactly this, and the results are fascinating. While they may not actually tell anything new about number theory they do highlight interesting structures in prime factorizations, and demonstrate how UMAP can aid in interesting explorations of datasets that we might think we know well. It's worth visiting the linked article -below as Dr. Williamson provides a rich and detiled exploration of UMAP as +below as Dr. Williamson provides a rich and detailed exploration of UMAP as applied to prime factorizations of integers. .. image:: images/umap_primes.png @@ -50,11 +50,11 @@ Language, Context, and Geometry in Neural Networks Among recent developments in natural language processing is the BERT neural network based technique for analysis of language. Among many things that BERT can do one is context sensitive embeddings of words -- providing numeric vector representations of words -that are sentive to the context of how the word is used. Exactly what goes on inside +that are sensitive to the context of how the word is used. Exactly what goes on inside the neural network to do this is a little mysterious (since the network is very complex with many many parameters). A tram of researchers from Google set out to explore the word embedding space generated by BERT, and among the tools used was UMAP. The linked -blog post provides a detailed and inspirign analysis of what BERT's word embeddings +blog post provides a detailed and inspiring analysis of what BERT's word embeddings look like, and how the different layers of BERT represent different aspects of language. .. image:: images/bert_embedding.png @@ -91,7 +91,7 @@ gives you over 150,000 texts to consider. Since the texts are open you can actua the text content involved. With some NLP and neural network wizardry David McClure build a network of such texts and then used node2vec and UMAP to generate a map of them. The result is a galaxy of textbooks showing inter-relationships between subjects, similar and related texts, -and genrally just a an interesting ladscape of science to be explored. As with some +and generally just a an interesting ladscape of science to be explored. As with some of the other projects here David made a great interactive viewer allowing for rich exploration of the results. diff --git a/doc/supervised.rst b/doc/supervised.rst index d91eaf51..0b670100 100644 --- a/doc/supervised.rst +++ b/doc/supervised.rst @@ -24,7 +24,7 @@ seaborn for plotting. Our example dataset for this exploration will be the `Fashion-MNIST dataset from Zalando Research `__. It is -desgined to be a drop-in replacement for the classic MNIST digits +designed to be a drop-in replacement for the classic MNIST digits dataset, but uses images of fashion items (dresses, coats, shoes, bags, etc.) instead of handwritten digits. Since the images are more complex it provides a greater challenge than MNIST digits. We can load it in @@ -86,7 +86,7 @@ a scatterplot. That took a little time, but not all that long considering it is 70,000 data points in 784 dimensional space. We can simply plot the results as a scatterplot, colored by the class of the fashion item. We can use -matplotlibs colorbar with suitable tick-labels to give us the color key. +matplotlib's colorbar with suitable tick-labels to give us the color key. .. code:: python3 @@ -109,7 +109,7 @@ separate quite so cleanly. In particular T-shirts, shirts, dresses, pullovers, and coats are all a little mixed. At the very least the dresses are largely separated, and the T-shirts are mostly in one large clump, but they are not well distinguished from the others. Worse still -are the coats, shirts, and pullovers (somewhat unsruprisingly as these +are the coats, shirts, and pullovers (somewhat unsurprisingly as these can certainly look very similar) which all have significant overlap with one another. Ideally we would like much better class separation. Since we have the label information we can actually give that to UMAP to use! @@ -169,7 +169,7 @@ distinct banding pattern that was visible in the original unsupervised case; the pants, t-shirts and bags both retained their shape and internal structure; etc. The second point to note is that we have also retained the global structure. While the individual classes have been -cleanly seprated from one another, the inter-relationships among the +cleanly separated from one another, the inter-relationships among the classes have been preserved: footwear classes are all near one another; trousers and bags are at opposite sides of the plot; and the arc of pullover, shirts, t-shirts and dresses is still in place. @@ -177,7 +177,7 @@ pullover, shirts, t-shirts and dresses is still in place. The key point is this: the important structural properties of the data have been retained while the known classes have been cleanly pulled apart and isolated. If you have data with known classes and want to -seprate them while still having a meaningful embedding of individual +separate them while still having a meaningful embedding of individual points then supervised UMAP can provide exactly what you need. Using Partial Labelling (Semi-Supervised UMAP) @@ -198,7 +198,7 @@ the noise points from a DBSCAN clustering). Now that we have randomly masked some of the labels we can try to perform supervised learning again. Everything works as before, but UMAP -will interpret the -1 label as beingan unlabelled point and learn +will interpret the -1 label as being an unlabelled point and learn accordingly. .. code:: python3 @@ -338,7 +338,7 @@ including much of the internal structure of the classes. For the most part assignment of new points follows the classes well. The greatest source of confusion in some t-shirts that ended up in mixed with the shirts, and some pullovers which are confused with the coats. Given the -difficulty of the problemn this is a good result, particularly when +difficulty of the problem this is a good result, particularly when compared with current state-of-the-art approaches such as `siamese and triplet networks `__.