Batch-wise computation of gene imputation. #569

MUCDK · 2023-06-26T08:16:10Z

All downstream methods should be linear in memory, i.e. allow for batch-wise computation.

What are the memory requirement for the mp.impute() and mp.correlate() functions? For mp.impute(), if I try to impute all the HVGs in my scRNA-seq dataset (~3k genes) I get XlaRuntimeError: RESOURCE_EXHAUSTED: Out of memory allocating XXXX bytes. If I specify a subset of ~30 genes to impute then it works perfectly fine, but I was wondering what is the upper limit of genes you can impute? The same memory error is also thrown when running mp.correlate(). I am using a VM with 300GB of RAM.

it should work for more than 30 genes, although to be fair we don't have the batch_wise implementation as it is done in cell_transition. Couple of questions on this:

are you running it on GPU? if yes, try to pass device="cpu" to the function call
how many cells are in the source and target distribution?
are there batches in the spatial data? meaning, do you explicitly pass batch_key in prepare?

an easy solution to this would be to do a for loop and concatenate resulting the anndatas such as:

adatas_l = []
for genes in np.array_split(adata_sc.var_names, 100): # split all genes in 100 lists of ~30 genes
    adatas_l.append(mp.impute(var_names=genes,))
adata_imputed = ad.concat(adatas_l, axis=1)

Are you planning to implement a function to also transfer the cell type labels from the dissociated data to spatial data or if not do you recommend a specific way of approaching the problem using the transferred gene expression information?

you could use the cell_transition method or alternatively this:

dummy = pd.get_dummies(adata_sc.obs["annotation"])
out= mp[("src", "tgt")].pull(dummy, scale_by_marginals=True)
clusters = pd.Categorical([dummy.columns[i] for i in np.array(out.argmax(1))])
adata_spatial.obs["annotation_mapped"] = clusters

the difference between the two methods is the way the cluster assignment for a spatial cell is selected, the first it selects base on sum of the transportation cost, the second based on the argmax. The former is more conservative and might return fewer clusters than the ones in the source. That might be a sensible thing (especially in the non-low rank case where you might have explicitly set some tau for the unbalance case) but might also not. Interpretation would be required.

Originally posted by @giovp in #559 (comment)

The text was updated successfully, but these errors were encountered:

MUCDK assigned ArinaDanilina Jun 26, 2023

MUCDK added the enhancement New feature or request label Jun 26, 2023

ArinaDanilina mentioned this issue Jul 19, 2023

adding _annotation_mapping in AnalysisMixin #585

Merged

ArinaDanilina linked a pull request Jul 26, 2023 that will close this issue

adding _annotation_mapping in AnalysisMixin #585

Merged

Marius1311 mentioned this issue Aug 15, 2023

Pull/push in a batch-wise fashion #592

Open

ArinaDanilina closed this as completed in #585 Jan 19, 2024

ArinaDanilina reopened this Jan 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch-wise computation of gene imputation. #569

Batch-wise computation of gene imputation. #569

MUCDK commented Jun 26, 2023

Batch-wise computation of gene imputation. #569

Batch-wise computation of gene imputation. #569

Comments

MUCDK commented Jun 26, 2023