Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestions on: Keep ensembl gene ID at read-in, automate non-unique gene renaming, add default rownames(), accept .h5 format :) #4

Open
estellad opened this issue Aug 27, 2024 · 7 comments
Assignees

Comments

@estellad
Copy link

Hi Marcel,

Thank you for the package! There has been a long-lasting issue with reading in Visium /outs, where there are non-unique gene names. There are 18085 genes, but 18082 unique gene names. Some of the downstream analyses do require unique gene names.

> tail(sort(table(rowData(vis)$Symbol)))

 ZYG11B     ZYX   ZZEF1  HSPA14    TBCE TMSB15B 
            1           1            1             2            2               2 

In Seurat, this was solved by adding a ".1" to one of the duplicates in the following genes "HSPA14", "TBCE", "TMSB15B". They are distinguishable based on their Ensembl IDs, which were preserved during the read-in process.

Previously with SpatialExperimen::read10xVisium(), where the Ensembl ID was preserved, I had a workaround for this issue by mimicking Seurat's approach:

rowData(vis)$gene_id <- rownames(vis)
rownames(vis) <- rowData(vis)$symbol

rownames(vis) <- ifelse(rowData(vis)$gene_id == "ENSG00000285053",  "TBCE", rownames(vis))
rownames(vis) <- ifelse(rowData(vis)$gene_id == "ENSG00000284770",  "TBCE.1", rownames(vis))
rownames(vis) <- ifelse(rowData(vis)$gene_id == "ENSG00000284024",  "HSPA14", rownames(vis))
rownames(vis) <- ifelse(rowData(vis)$gene_id == "ENSG00000187522",  "HSPA14.1", rownames(vis))
rownames(vis) <- ifelse(rowData(vis)$gene_id == "ENSG00000158427",  "TMSB15B", rownames(vis))
rownames(vis) <- ifelse(rowData(vis)$gene_id == "ENSG00000269226",  "TMSB15B.1", rownames(vis))

With VisiumIO::TENxVisium(), however, I noticed that Ensembl ID is lost or not retrieved from the gene count matrix folder during the read-in.

vis <- TENxVisium(
  spacerangerOut = vis_path, processing = "filtered", images = "lowres"
)
vis <- import(vis)
vis
class: SpatialExperiment 
dim: 18085 4992 
metadata(0):
assays(1): counts
rownames: NULL
rowData names(1): Symbol
colnames(4992): AACACCTACTATCGAA-1 AACACGTGCATCGCAC-1 ...
  TGTTGGCCAGACCTAC-1 TGTTGGCCTACACGTG-1
colData names(4): in_tissue array_row array_col sample_id
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):
spatialCoords names(2) : pxl_col_in_fullres pxl_row_in_fullres
imgData names(4): sample_id image_id data scaleFactor

Therefore, I cannot rename them based on Ensembl ID anymore. Could you please automate this renaming process internally, like Seurat, so that users do not have to do this manually every time? Given these 3 genes are not highly expressed, I excluded them in my analysis for now.

In addition, a small inconvenience is the rownames() of the object is left as NULL, but I would prefer the default to have something, either Ensembl ID or Symbol.

I also noticed that VisiumIO::TENxVisium() cannot take a gene count .h5 file, as an alternative to the gene count matrix folder. I find this alternative helpful. I made it flexible in my package SpatialExperimentIO::readXeniumSXE(). Would you consider implementing this alternative again, as SpatialExperimen::read10xVisium() did before?

Also looping in @drighelli and @HelenaLC

Thank you for your time!

Sincerely,
Estella

@drighelli
Copy link

My two cents:
it would be ok to append the .1 to the duplicates, but maybe it could be nice to print a warning or just a message alerting the user of this behaviour.

Also, wouldn't it be faster to do something like :

rownames(vis)[duplicated(vis)] <- paste0(rownames(vis)[duplicated(vis)], ".1")
message("Found sum(duplicated(vis)) duplicated rownames, appending .1 to them")

I have another question, are we sure that we can always find only 1 duplicate for each gene? if there are more?

Maybe it could be better to generalize this with an increasing number from 1 to n, where n is the number of duplicates per each duplicated rowname.

About the .h5 matrix, it would be nice to have the possibility to load it instead of the csv matrix.

Dario

@HelenaLC
Copy link

FYI: base R make.unique() is made to do this...

> make.unique(c("a", "a", "a"))
[1] "a"   "a.1" "a.2"

...and yes, I'd be in favor of 1) keeping both gene symbols and ensembl IDs as rowData(), and 2) taking care of setting & unique-ifying rownames() for the user from the symbols (with a warning when duplicates were found, as Dario suggested).

@LiNk-NY
Copy link
Contributor

LiNk-NY commented Sep 23, 2024

Hi Estella! @estellad
Can you try this again with TENxIO version 1.7.7 installed? The latest devel version is available on GitHub until it gets built on the BBS.

BiocManager::install("waldronlab/TENxIO")

@LiNk-NY
Copy link
Contributor

LiNk-NY commented Sep 23, 2024

About the .h5 matrix, it would be nice to have the possibility to load it instead of the csv matrix.

@drighelli
Thanks! This is not supported in VisiumIO

@LiNk-NY
Copy link
Contributor

LiNk-NY commented Sep 23, 2024

Note. By default, I'm reading in the first column in the features.tsv.gz file which is typically the ID column with ensembl IDs. And HDF5Array::TENxMatrix uses the ensembl IDs as rownames as well.

@estellad
Copy link
Author

estellad commented Sep 24, 2024

Hi Marcel @LiNk-NY ,

I tried to install TENxIO and then re-install VisiumIO.

BiocManager::install("waldronlab/TENxIO")
BiocManager::install("VisiumIO", force = TRUE)

The imported Visium object now has gene Ensembl ID as the default rownames and gene Symbol stored in rowData(). 👍

Here are still some action items 😄 :

  • The duplicate gene Symbols have not been fixed with make.unique(), and no warning message on the duplicated gene names are given.

  • I tried to remove the Matrix folder in /outs, but VisiumIO::TENxVisium() still does not take only .h5 as input. Do you plan on linking the function to TENxIO::TENxH5(). It should be a very simple update but would accept another file format.

Thank you!
Estella

@LiNk-NY
Copy link
Contributor

LiNk-NY commented Sep 24, 2024

Hi Estella, @estellad

  • The duplicate gene Symbols have not been fixed with make.unique(), and no warning message on the duplicated gene names are given.

This is actually implemented here https://github.com/waldronlab/TENxIO/blob/7b86cc13a75a3bebc4def176ac5ca334bcadf9a9/R/TENxH5-class.R#L338-L343
but it only happens with rownames. Do you mean going into rowData()$Symbols and running make.unique() on it? It think it would be better not to munge the data unless necessary but I could add a helper function to do that if needed..

  • I tried to remove the Matrix folder in /outs, but VisiumIO::TENxVisium() still does not take only .h5 as input. Do you plan on linking the function to TENxIO::TENxH5(). It should be a very simple update but would accept another file format.

Do you have a reproducible example? I have tested this with the examples in the package. See :

#' tvfl <- TENxVisium(
#' spacerangerOut = sample_dir,
#' format = "h5",
#' processing = "raw",
#' images = "lowres"
#' )

Thanks!

-Marcel

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants