AlignedUMAP on (partially) disjoint datasets #596

dawe · 2021-02-18T16:33:16Z

dawe
Feb 18, 2021

[disclaimer: I posted this in the issue tracker but it is more appropriate for discussion]

I read with much interest the new functionalities implemented in umap-learn, both the operations on different models and the mapper integration for AlignedUMAP.
As far as I understand, the operations are supported only for different models that share the same data (1:1 correspondence between points). AlignedUMAP, instead, can work provided that some points are shared across different models.
I wonder if it is possible to align multiple models when I don't have a strict notion of correspondence among data but only a loose one.
I'll make an example to clarify this: suppose I collect multiple n features (x1, x2,…xn) on individuals sampled from different populations (say A, B, C) and I fit a UMAP model U for it. Now, I collect a set of m different (but somehow related) features (t1, t2…tm) from other individuals sampled from the same populations (A, B, C) and fit another model V. Suppose I don't know if an individual has been sampled twice and characterized for both the feature sets, I only know their reference populations. Also, let's assume that the features Is there a way to align U and V?
I work in the field of molecular biology, but to generalize the example: collect anonymously Netflix and Amazon user profiles by country, fit two models and align the models enforcing users to be loosely matched by their country (country is not a feature to be included in the model)
One way to go, maybe, could be to sample a fixed number of points and assign random association within populations, fit the AlignedUMAP, repeat a sufficient number of times, finally extract the consensus among aligned models, but I guess there may be more elegant ways to go here.

lmcinnes · 2021-02-18T17:57:10Z

lmcinnes
Feb 18, 2021
Maintainer

This seems like a specific case of a more general problem where we want some level of fuzzy matching between the datasets -- in this case we have a third variable (country in the Netflix Amazon example) that provides a groupby style matching. This is certainly something that could be done, but isn't implemented in the code as it stands. At the very least it is likely possible to hack the AlignedUMAP code itself to do something like what you want now, presuming you need it sooner rather than later and are willing to do some work to get it. A caveat -- inevitably there are all manner of small subtle problems that arise when doing this in practice, so it could end up getting messy.

The core piece here is the functions _optimize_layout_aligned_euclidean_single_epoch and optimize_layout_aligned_euclidean (see

umap/umap/layouts.py

Lines 695 to 929 in c55cc36

    
           def _optimize_layout_aligned_euclidean_single_epoch( 
        
               head_embeddings, 
        
               tail_embeddings, 
        
               heads, 
        
               tails, 
        
               epochs_per_sample, 
        
               a, 
        
               b, 
        
               regularisation_weights, 
        
               relations, 
        
               rng_state, 
        
               gamma, 
        
               lambda_, 
        
               dim, 
        
               move_other, 
        
               alpha, 
        
               epochs_per_negative_sample, 
        
               epoch_of_next_negative_sample, 
        
               epoch_of_next_sample, 
        
               n, 
        
           ): 
        
               n_embeddings = len(heads) 
        
               window_size = (relations.shape[1] - 1) // 2 
        
               max_n_edges = 0 
        
               for e_p_s in epochs_per_sample: 
        
                   if e_p_s.shape[0] >= max_n_edges: 
        
                       max_n_edges = e_p_s.shape[0] 
        
               embedding_order = np.arange(n_embeddings).astype(np.int32) 
        
               np.random.shuffle(embedding_order) 
        
               for i in range(max_n_edges): 
        
                   for m in embedding_order: 
        
                       if i < epoch_of_next_sample[m].shape[0] and epoch_of_next_sample[m][i] <= n: 
        
                           j = heads[m][i] 
        
                           k = tails[m][i] 
        
                           current = head_embeddings[m][j] 
        
                           other = tail_embeddings[m][k] 
        
                           dist_squared = rdist(current, other) 
        
                           if dist_squared > 0.0: 
        
                               grad_coeff = -2.0 * a * b * pow(dist_squared, b - 1.0) 
        
                               grad_coeff /= a * pow(dist_squared, b) + 1.0 
        
                           else: 
        
                               grad_coeff = 0.0 
        
                           for d in range(dim): 
        
                               grad_d = clip(grad_coeff * (current[d] - other[d])) 
        
                               for offset in range(-window_size, window_size): 
        
                                   neighbor_m = m + offset 
        
                                   if ( 
        
                                       neighbor_m >= 0 
        
                                       and neighbor_m < n_embeddings 
        
                                       and offset != 0 
        
                                   ): 
        
                                       identified_index = relations[m, offset + window_size, j] 
        
                                       if identified_index >= 0: 
        
                                           grad_d -= clip( 
        
                                               (lambda_ * np.exp(-(np.abs(offset) - 1))) 
        
                                               * regularisation_weights[m, offset + window_size, j] 
        
                                               * ( 
        
                                                   current[d] 
        
                                                   - head_embeddings[neighbor_m][ 
        
                                                       identified_index, d 
        
                                                   ] 
        
                                               ) 
        
                                           ) 
        
                               current[d] += clip(grad_d) * alpha 
        
                               if move_other: 
        
                                   other_grad_d = clip(grad_coeff * (other[d] - current[d])) 
        
                                   for offset in range(-window_size, window_size): 
        
                                       neighbor_m = m + offset 
        
                                       if ( 
        
                                           neighbor_m >= 0 
        
                                           and neighbor_m < n_embeddings 
        
                                           and offset != 0 
        
                                       ): 
        
                                           identified_index = relations[m, offset + window_size, k] 
        
                                           if identified_index >= 0: 
        
                                               grad_d -= clip( 
        
                                                   (lambda_ * np.exp(-(np.abs(offset) - 1))) 
        
                                                   * regularisation_weights[ 
        
                                                       m, offset + window_size, k 
        
                                                   ] 
        
                                                   * ( 
        
                                                       other[d] 
        
                                                       - head_embeddings[neighbor_m][ 
        
                                                           identified_index, d 
        
                                                       ] 
        
                                                   ) 
        
                                               ) 
        
                                   other[d] += clip(other_grad_d) * alpha 
        
                           epoch_of_next_sample[m][i] += epochs_per_sample[m][i] 
        
                           if epochs_per_negative_sample[m][i] > 0: 
        
                               n_neg_samples = int( 
        
                                   (n - epoch_of_next_negative_sample[m][i]) 
        
                                   / epochs_per_negative_sample[m][i] 
        
                               ) 
        
                           else: 
        
                               n_neg_samples = 0 
        
                           for p in range(n_neg_samples): 
        
                               k = tau_rand_int(rng_state) % tail_embeddings[m].shape[0] 
        
                               other = tail_embeddings[m][k] 
        
                               dist_squared = rdist(current, other) 
        
                               if dist_squared > 0.0: 
        
                                   grad_coeff = 2.0 * gamma * b 
        
                                   grad_coeff /= (0.001 + dist_squared) * ( 
        
                                       a * pow(dist_squared, b) + 1 
        
                                   ) 
        
                               elif j == k: 
        
                                   continue 
        
                               else: 
        
                                   grad_coeff = 0.0 
        
                               for d in range(dim): 
        
                                   if grad_coeff > 0.0: 
        
                                       grad_d = clip(grad_coeff * (current[d] - other[d])) 
        
                                   else: 
        
                                       grad_d = 4.0 
        
                                   for offset in range(-window_size, window_size): 
        
                                       neighbor_m = m + offset 
        
                                       if ( 
        
                                           neighbor_m >= 0 
        
                                           and neighbor_m < n_embeddings 
        
                                           and offset != 0 
        
                                       ): 
        
                                           identified_index = relations[m, offset + window_size, j] 
        
                                           if identified_index >= 0: 
        
                                               grad_d -= clip( 
        
                                                   (lambda_ * np.exp(-(np.abs(offset) - 1))) 
        
                                                   * regularisation_weights[ 
        
                                                       m, offset + window_size, j 
        
                                                   ] 
        
                                                   * ( 
        
                                                       current[d] 
        
                                                       - head_embeddings[neighbor_m][ 
        
                                                           identified_index, d 
        
                                                       ] 
        
                                                   ) 
        
                                               ) 
        
                                   current[d] += clip(grad_d) * alpha 
        
                           epoch_of_next_negative_sample[m][i] += ( 
        
                               n_neg_samples * epochs_per_negative_sample[m][i] 
        
                           ) 
        
           def optimize_layout_aligned_euclidean( 
        
               head_embeddings, 
        
               tail_embeddings, 
        
               heads, 
        
               tails, 
        
               n_epochs, 
        
               epochs_per_sample, 
        
               regularisation_weights, 
        
               relations, 
        
               rng_state, 
        
               a=1.576943460405378, 
        
               b=0.8950608781227859, 
        
               gamma=1.0, 
        
               lambda_=5e-3, 
        
               initial_alpha=1.0, 
        
               negative_sample_rate=5.0, 
        
               parallel=True, 
        
               verbose=False, 
        
           ): 
        
               dim = head_embeddings[0].shape[1] 
        
               move_other = head_embeddings[0].shape[0] == tail_embeddings[0].shape[0] 
        
               alpha = initial_alpha 
        
               epochs_per_negative_sample = numba.typed.List.empty_list(numba.types.float32[::1]) 
        
               epoch_of_next_negative_sample = numba.typed.List.empty_list( 
        
                   numba.types.float32[::1] 
        
               ) 
        
               epoch_of_next_sample = numba.typed.List.empty_list(numba.types.float32[::1]) 
        
               for m in range(len(heads)): 
        
                   epochs_per_negative_sample.append( 
        
                       epochs_per_sample[m].astype(np.float32) / negative_sample_rate 
        
                   ) 
        
                   epoch_of_next_negative_sample.append( 
        
                       epochs_per_negative_sample[m].astype(np.float32) 
        
                   ) 
        
                   epoch_of_next_sample.append(epochs_per_sample[m].astype(np.float32)) 
        
               optimize_fn = numba.njit( 
        
                   _optimize_layout_aligned_euclidean_single_epoch, 
        
                   fastmath=True, 
        
                   parallel=parallel, 
        
               ) 
        
               for n in range(n_epochs): 
        
                   optimize_fn( 
        
                       head_embeddings, 
        
                       tail_embeddings, 
        
                       heads, 
        
                       tails, 
        
                       epochs_per_sample, 
        
                       a, 
        
                       b, 
        
                       regularisation_weights, 
        
                       relations, 
        
                       rng_state, 
        
                       gamma, 
        
                       lambda_, 
        
                       dim, 
        
                       move_other, 
        
                       alpha, 
        
                       epochs_per_negative_sample, 
        
                       epoch_of_next_negative_sample, 
        
                       epoch_of_next_sample, 
        
                       n, 
        
                   ) 
        
                   alpha = initial_alpha * (1.0 - (float(n) / float(n_epochs))) 
        
                   if verbose and n % int(n_epochs / 10) == 0: 
        
                       print("\tcompleted ", n, " / ", n_epochs, "epochs") 
        
               return head_embeddings

). The rest of the AlignedUMAP code is essentially machinations to clean up and set up the data passed in to be able to feed it to this. Thos preparations will likely also need to be altered, but that might be easier to work out; we'll leave that 'til later.

Those functions effectively just iterate through the different UMAP embeddings to perform (using m as the indexing variable into different embeddings) and run a UMAP optimization for each one. The real trick is added code such as lines 747-765; you'll see similar blocks elsewhere in _optimize_layout_aligned_euclidean_single_epoch. Now, presuming you are aligning only 2 datasets (to make things easier) you can ignore the loop over windows -- we just want to look at the other dataset than the one we are currently looking at (the windowing looks at nearby datasets in a linear order of datasets). We are then going to find an identified point in the other dataset as per a relation between the datasets, and adjust our gradient according to the distance between our current point and the identified point.

It seems that what you would want to do is replace that with something that loops over (a subsample!) of identified points (i.e. a sampling of points in the other dataset that have the same country) and perform the same gradient update. Now the gradient update (lines 756-765 for example) has some extra terms that you probably won't need to worry about if you only have two datasets; specifically np.exp(-(np.abs(offset) - 1)) which handles windowing out to look at nearby datasets, and regularisation_weights[m, offset + window_size, j] which is a term that considers how similar the neighborhood structure around identified points is, but which doesn't really make sense for your matching by country example. So -- replace that code with a suitable loop and you'll be okay.

The next catch is that all of this is happening in a tight loop deep in the code, so you want it all to happen as efficiently as possible (hence, for example, subsampling points that match on country; probably only a few will suffice provided you use a different random sample each time). That's why you'll note that the relation to find a matching identified point is happening via a 3-dimensional array and not via a dict as originally got passed in. You'll need to work out what data to pass in, ideally in vector format for numba to make efficient use of it, to achieve you random sampling according to country, or whatever identifier you use.

I will stop here, and see if all of that seems like the sort of effort you would be willing to put in. If so we can discuss more details from there. This is ultimately getting your hands pretty messy mucking with code, but it should be doable depending on how much you are willing to build a custom UMAP embedder out of the primitives in the existing codebase, and by modifying what is already there.

1 reply

dawe Feb 18, 2021
Author

Wow, I'm speechless, thank you for sharing this!
I have indeed two datasets whose embeddings you can align by eye (BTW, this is the first time in forever it happens).
I will try to work on this, I hope I can get back to you soon.

[EDIT]
TBH I'm not even sure if the random subsampling is a reasonable way to go.

dawe · 2021-08-18T12:29:27Z

dawe
Aug 18, 2021
Author

Hi again, unfortunately I haven't worked on this as I was immersed in deadlines and other tasks. However, I came into a number of works about optimal transport (a subject I didn't know anything about… as usual) and I believe my first question was somehow a OT problem. I was looking at python OT library, and in particular to the applications of Fused Gromov-Wasserstein, described in this paper. Do you think it could be a good idea to use FGW (or in general OT) on the graphs underlying multiple UMAP models to initially match data? Also, could barycenters be used with the same purpose of AlignedUMAP, that is having a joint embedding over multiple sources?

1 reply

lmcinnes Oct 25, 2021
Maintainer

Interestingly enough I have been working on OT related stuff myself recently for an entirely different project. I haven't had time to look at the cited paper in any depth, but it does look related to some others that I read recently which at least covered some Gromov-Wasserstein distance related ideas. I think there is definitely merit in these sorts of approaches. I think it may require significant thought and work to manage to get something sensible that will apply well to the sorts of problems we are discussing here. That is to say, I think there is a lot of value in looking into this, but I think it is also a non-trivial research project. A lot will depend on how much time you are willing/wanting to invest. Given my own OT interests I suspect I will eventually cycle around to touching on such things, but I may not do it anytime soon, nor necessarily end up addressing the problems that are most of interest to you here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AlignedUMAP on (partially) disjoint datasets #596

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

AlignedUMAP on (partially) disjoint datasets #596

dawe Feb 18, 2021

Replies: 2 comments · 2 replies

lmcinnes Feb 18, 2021 Maintainer

dawe Feb 18, 2021 Author

dawe Aug 18, 2021 Author

lmcinnes Oct 25, 2021 Maintainer

dawe
Feb 18, 2021

Replies: 2 comments 2 replies

lmcinnes
Feb 18, 2021
Maintainer

dawe Feb 18, 2021
Author

dawe
Aug 18, 2021
Author

lmcinnes Oct 25, 2021
Maintainer