Replies: 2 comments 2 replies
-
This seems like a specific case of a more general problem where we want some level of fuzzy matching between the datasets -- in this case we have a third variable (country in the Netflix Amazon example) that provides a groupby style matching. This is certainly something that could be done, but isn't implemented in the code as it stands. At the very least it is likely possible to hack the AlignedUMAP code itself to do something like what you want now, presuming you need it sooner rather than later and are willing to do some work to get it. A caveat -- inevitably there are all manner of small subtle problems that arise when doing this in practice, so it could end up getting messy. The core piece here is the functions Lines 695 to 929 in c55cc36 Those functions effectively just iterate through the different UMAP embeddings to perform (using It seems that what you would want to do is replace that with something that loops over (a subsample!) of identified points (i.e. a sampling of points in the other dataset that have the same country) and perform the same gradient update. Now the gradient update (lines 756-765 for example) has some extra terms that you probably won't need to worry about if you only have two datasets; specifically The next catch is that all of this is happening in a tight loop deep in the code, so you want it all to happen as efficiently as possible (hence, for example, subsampling points that match on country; probably only a few will suffice provided you use a different random sample each time). That's why you'll note that the relation to find a matching identified point is happening via a 3-dimensional array and not via a dict as originally got passed in. You'll need to work out what data to pass in, ideally in vector format for numba to make efficient use of it, to achieve you random sampling according to country, or whatever identifier you use. I will stop here, and see if all of that seems like the sort of effort you would be willing to put in. If so we can discuss more details from there. This is ultimately getting your hands pretty messy mucking with code, but it should be doable depending on how much you are willing to build a custom UMAP embedder out of the primitives in the existing codebase, and by modifying what is already there. |
Beta Was this translation helpful? Give feedback.
-
Hi again, unfortunately I haven't worked on this as I was immersed in deadlines and other tasks. However, I came into a number of works about optimal transport (a subject I didn't know anything about… as usual) and I believe my first question was somehow a OT problem. I was looking at python OT library, and in particular to the applications of Fused Gromov-Wasserstein, described in this paper. Do you think it could be a good idea to use FGW (or in general OT) on the graphs underlying multiple UMAP models to initially match data? Also, could barycenters be used with the same purpose of AlignedUMAP, that is having a joint embedding over multiple sources? |
Beta Was this translation helpful? Give feedback.
-
[disclaimer: I posted this in the issue tracker but it is more appropriate for discussion]
I read with much interest the new functionalities implemented in umap-learn, both the operations on different models and the mapper integration for AlignedUMAP.
As far as I understand, the operations are supported only for different models that share the same data (1:1 correspondence between points). AlignedUMAP, instead, can work provided that some points are shared across different models.
I wonder if it is possible to align multiple models when I don't have a strict notion of correspondence among data but only a loose one.
I'll make an example to clarify this: suppose I collect multiple n features (x1, x2,…xn) on individuals sampled from different populations (say A, B, C) and I fit a UMAP model U for it. Now, I collect a set of m different (but somehow related) features (t1, t2…tm) from other individuals sampled from the same populations (A, B, C) and fit another model V. Suppose I don't know if an individual has been sampled twice and characterized for both the feature sets, I only know their reference populations. Also, let's assume that the features Is there a way to align U and V?
I work in the field of molecular biology, but to generalize the example: collect anonymously Netflix and Amazon user profiles by country, fit two models and align the models enforcing users to be loosely matched by their country (country is not a feature to be included in the model)
One way to go, maybe, could be to sample a fixed number of points and assign random association within populations, fit the AlignedUMAP, repeat a sufficient number of times, finally extract the consensus among aligned models, but I guess there may be more elegant ways to go here.
Beta Was this translation helpful? Give feedback.
All reactions