Skip to content

Commit

Permalink
Merge remote-tracking branch 'origin/master'
Browse files Browse the repository at this point in the history
  • Loading branch information
lmcinnes committed Apr 29, 2020
2 parents 0954917 + fdd1096 commit 1197625
Show file tree
Hide file tree
Showing 5 changed files with 19 additions and 19 deletions.
6 changes: 3 additions & 3 deletions doc/basic_usage.rst
Original file line number Diff line number Diff line change
Expand Up @@ -142,7 +142,7 @@ can a dimension reduction technique like UMAP do for us? By reducing the
dimension in a way that preserves as much of the structure of the data
as possible we can get a visualisable representation of the data
allowing us to "see" the data and its structure and begin to get some
inuitions about the data itself.
intuition about the data itself.

To use UMAP for this task we need to first construct a UMAP object that
will do the job for us. That is as simple as instantiating the class. So
Expand Down Expand Up @@ -198,7 +198,7 @@ the original).
This does a useful job of capturing the structure of the data, and as
can be seen from the matrix of scatterplots this is relatively accurate.
Of course we learned at least this much just from that matrix of
scatterplots -- which we could do since we only had four differnt
scatterplots -- which we could do since we only had four different
dimensions to analyse. If we had data with a larger number of dimensions
the scatterplot matrix would quickly become unwieldy to plot, and far
harder to interpret. So moving on from the Iris dataset, let's consider
Expand Down Expand Up @@ -362,7 +362,7 @@ of the reducer object, or call transform on the original data.
We now have a dataset with 1797 rows (one for each hand-written digit
sample), but only 2 columns. As with the Iris example we can now plot
the resulting embedding, coloring the data points by the class that
theyr belong to (i.e. the digit they represent).
they belong to (i.e. the digit they represent).

.. code:: python3
Expand Down
8 changes: 4 additions & 4 deletions doc/clustering.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,14 +4,14 @@ Using UMAP for Clustering
UMAP can be used as an effective preprocessing step to boost the
performance of density based clustering. This is somewhat controversial,
and should be attempted with care. For a good discussion of some of the
issues involved in this please see the various answers `in this
issues involved in this, please see the various answers `in this
stackoverflow
thread <https://stats.stackexchange.com/questions/263539/clustering-on-the-output-of-t-sne>`__
on clustering the results of t-SNE. Many of the points of concern raised
there are salient for clustering the results of UMAP. The most notable
is that UMAP, like t-SNE, does not completely preserve density. UMAP,
like t-SNE, can also create tears in clusters that are not actually
present, resulting in a finer clustering than is necessarily present in
like t-SNE, can also create false tears in clusters, resulting in a
finer clustering than is necessarily present in
the data. Despite these concerns there are still valid reasons to use
UMAP as a preprocessing step for clustering. As with any clustering
approach one will want to do some exploration and evaluation of the
Expand Down Expand Up @@ -136,7 +136,7 @@ of largely spherical clusters -- this is responsible for some of the
sharp divides that K-Means puts across digit classes. We can potentially
improve on this by using a smarter density based algorithm. In this case
we've chosen to try HDBSCAN, which we believe to be among the most
advanced density based tehcniques. For the sake of performance we'll
advanced density based techniques. For the sake of performance we'll
reduce the dimensionality of the data down to 50 dimensions via PCA
(this recovers most of the variance), since HDBSCAN scales somewhat
poorly with the dimensionality of the data it will work on.
Expand Down
8 changes: 4 additions & 4 deletions doc/exploratory_analysis.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ exactly this, and the results are fascinating. While they may not actually tell
anything new about number theory they do highlight interesting structures
in prime factorizations, and demonstrate how UMAP can aid in interesting explorations
of datasets that we might think we know well. It's worth visiting the linked article
below as Dr. Williamson provides a rich and detiled exploration of UMAP as
below as Dr. Williamson provides a rich and detailed exploration of UMAP as
applied to prime factorizations of integers.

.. image:: images/umap_primes.png
Expand Down Expand Up @@ -50,11 +50,11 @@ Language, Context, and Geometry in Neural Networks
Among recent developments in natural language processing is the BERT neural network
based technique for analysis of language. Among many things that BERT can do one is
context sensitive embeddings of words -- providing numeric vector representations of words
that are sentive to the context of how the word is used. Exactly what goes on inside
that are sensitive to the context of how the word is used. Exactly what goes on inside
the neural network to do this is a little mysterious (since the network is very complex
with many many parameters). A tram of researchers from Google set out to explore the
word embedding space generated by BERT, and among the tools used was UMAP. The linked
blog post provides a detailed and inspirign analysis of what BERT's word embeddings
blog post provides a detailed and inspiring analysis of what BERT's word embeddings
look like, and how the different layers of BERT represent different aspects of language.

.. image:: images/bert_embedding.png
Expand Down Expand Up @@ -91,7 +91,7 @@ gives you over 150,000 texts to consider. Since the texts are open you can actua
the text content involved. With some NLP and neural network wizardry David McClure build
a network of such texts and then used node2vec and UMAP to generate a map of them. The result
is a galaxy of textbooks showing inter-relationships between subjects, similar and related texts,
and genrally just a an interesting ladscape of science to be explored. As with some
and generally just a an interesting ladscape of science to be explored. As with some
of the other projects here David made a great interactive viewer allowing for rich exploration
of the results.

Expand Down
2 changes: 1 addition & 1 deletion doc/how_umap_works.rst
Original file line number Diff line number Diff line change
Expand Up @@ -97,7 +97,7 @@ intersections of sets. The key is that the background topological theory
actually provides guarantees about how well this simple process can
produce something that represents the topological space itself in a
meaningful way (the `Nerve
theorem <https://en.wikipedia.org/wiki/Nerve_theorem>`__ is the relevant
theorem <https://en.wikipedia.org/wiki/Nerve_of_a_covering>`__ is the relevant
result for those interested). Obviously the quality of the cover is
important, and finer covers provide more accuracy, but the reality is
that despite its simplicity the process captures much of the topology.
Expand Down
14 changes: 7 additions & 7 deletions doc/supervised.rst
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ seaborn for plotting.
Our example dataset for this exploration will be the `Fashion-MNIST
dataset from Zalando
Research <https://github.com/zalandoresearch/fashion-mnist>`__. It is
desgined to be a drop-in replacement for the classic MNIST digits
designed to be a drop-in replacement for the classic MNIST digits
dataset, but uses images of fashion items (dresses, coats, shoes, bags,
etc.) instead of handwritten digits. Since the images are more complex
it provides a greater challenge than MNIST digits. We can load it in
Expand Down Expand Up @@ -86,7 +86,7 @@ a scatterplot.
That took a little time, but not all that long considering it is 70,000
data points in 784 dimensional space. We can simply plot the results as
a scatterplot, colored by the class of the fashion item. We can use
matplotlibs colorbar with suitable tick-labels to give us the color key.
matplotlib's colorbar with suitable tick-labels to give us the color key.

.. code:: python3
Expand All @@ -109,7 +109,7 @@ separate quite so cleanly. In particular T-shirts, shirts, dresses,
pullovers, and coats are all a little mixed. At the very least the
dresses are largely separated, and the T-shirts are mostly in one large
clump, but they are not well distinguished from the others. Worse still
are the coats, shirts, and pullovers (somewhat unsruprisingly as these
are the coats, shirts, and pullovers (somewhat unsurprisingly as these
can certainly look very similar) which all have significant overlap with
one another. Ideally we would like much better class separation. Since
we have the label information we can actually give that to UMAP to use!
Expand Down Expand Up @@ -169,15 +169,15 @@ distinct banding pattern that was visible in the original unsupervised
case; the pants, t-shirts and bags both retained their shape and
internal structure; etc. The second point to note is that we have also
retained the global structure. While the individual classes have been
cleanly seprated from one another, the inter-relationships among the
cleanly separated from one another, the inter-relationships among the
classes have been preserved: footwear classes are all near one another;
trousers and bags are at opposite sides of the plot; and the arc of
pullover, shirts, t-shirts and dresses is still in place.

The key point is this: the important structural properties of the data
have been retained while the known classes have been cleanly pulled
apart and isolated. If you have data with known classes and want to
seprate them while still having a meaningful embedding of individual
separate them while still having a meaningful embedding of individual
points then supervised UMAP can provide exactly what you need.

Using Partial Labelling (Semi-Supervised UMAP)
Expand All @@ -198,7 +198,7 @@ the noise points from a DBSCAN clustering).
Now that we have randomly masked some of the labels we can try to
perform supervised learning again. Everything works as before, but UMAP
will interpret the -1 label as beingan unlabelled point and learn
will interpret the -1 label as being an unlabelled point and learn
accordingly.

.. code:: python3
Expand Down Expand Up @@ -338,7 +338,7 @@ including much of the internal structure of the classes. For the most
part assignment of new points follows the classes well. The greatest
source of confusion in some t-shirts that ended up in mixed with the
shirts, and some pullovers which are confused with the coats. Given the
difficulty of the problemn this is a good result, particularly when
difficulty of the problem this is a good result, particularly when
compared with current state-of-the-art approaches such as `siamese and
triplet
networks <https://github.com/adambielski/siamese-triplet/blob/master/Experiments_FashionMNIST.ipynb>`__.

0 comments on commit 1197625

Please sign in to comment.