Suggestions on: Keep ensembl gene ID at read-in, automate non-unique gene renaming, add default rownames(), accept .h5 format :) #4

estellad · 2024-08-27T10:52:23Z

Hi Marcel,

Thank you for the package! There has been a long-lasting issue with reading in Visium /outs, where there are non-unique gene names. There are 18085 genes, but 18082 unique gene names. Some of the downstream analyses do require unique gene names.

> tail(sort(table(rowData(vis)$Symbol)))

 ZYG11B     ZYX   ZZEF1  HSPA14    TBCE TMSB15B 
            1           1            1             2            2               2

In Seurat, this was solved by adding a ".1" to one of the duplicates in the following genes "HSPA14", "TBCE", "TMSB15B". They are distinguishable based on their Ensembl IDs, which were preserved during the read-in process.

Previously with SpatialExperimen::read10xVisium(), where the Ensembl ID was preserved, I had a workaround for this issue by mimicking Seurat's approach:

rowData(vis)$gene_id <- rownames(vis)
rownames(vis) <- rowData(vis)$symbol

rownames(vis) <- ifelse(rowData(vis)$gene_id == "ENSG00000285053",  "TBCE", rownames(vis))
rownames(vis) <- ifelse(rowData(vis)$gene_id == "ENSG00000284770",  "TBCE.1", rownames(vis))
rownames(vis) <- ifelse(rowData(vis)$gene_id == "ENSG00000284024",  "HSPA14", rownames(vis))
rownames(vis) <- ifelse(rowData(vis)$gene_id == "ENSG00000187522",  "HSPA14.1", rownames(vis))
rownames(vis) <- ifelse(rowData(vis)$gene_id == "ENSG00000158427",  "TMSB15B", rownames(vis))
rownames(vis) <- ifelse(rowData(vis)$gene_id == "ENSG00000269226",  "TMSB15B.1", rownames(vis))

With VisiumIO::TENxVisium(), however, I noticed that Ensembl ID is lost or not retrieved from the gene count matrix folder during the read-in.

vis <- TENxVisium(
  spacerangerOut = vis_path, processing = "filtered", images = "lowres"
)
vis <- import(vis)
vis
class: SpatialExperiment 
dim: 18085 4992 
metadata(0):
assays(1): counts
rownames: NULL
rowData names(1): Symbol
colnames(4992): AACACCTACTATCGAA-1 AACACGTGCATCGCAC-1 ...
  TGTTGGCCAGACCTAC-1 TGTTGGCCTACACGTG-1
colData names(4): in_tissue array_row array_col sample_id
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):
spatialCoords names(2) : pxl_col_in_fullres pxl_row_in_fullres
imgData names(4): sample_id image_id data scaleFactor

Therefore, I cannot rename them based on Ensembl ID anymore. Could you please automate this renaming process internally, like Seurat, so that users do not have to do this manually every time? Given these 3 genes are not highly expressed, I excluded them in my analysis for now.

In addition, a small inconvenience is the rownames() of the object is left as NULL, but I would prefer the default to have something, either Ensembl ID or Symbol.

I also noticed that VisiumIO::TENxVisium() cannot take a gene count .h5 file, as an alternative to the gene count matrix folder. I find this alternative helpful. I made it flexible in my package SpatialExperimentIO::readXeniumSXE(). Would you consider implementing this alternative again, as SpatialExperimen::read10xVisium() did before?

Also looping in @drighelli and @HelenaLC

Thank you for your time!

Sincerely,
Estella

The text was updated successfully, but these errors were encountered:

drighelli · 2024-08-27T14:57:38Z

My two cents:
it would be ok to append the .1 to the duplicates, but maybe it could be nice to print a warning or just a message alerting the user of this behaviour.

Also, wouldn't it be faster to do something like :

rownames(vis)[duplicated(vis)] <- paste0(rownames(vis)[duplicated(vis)], ".1")
message("Found sum(duplicated(vis)) duplicated rownames, appending .1 to them")

I have another question, are we sure that we can always find only 1 duplicate for each gene? if there are more?

Maybe it could be better to generalize this with an increasing number from 1 to n, where n is the number of duplicates per each duplicated rowname.

About the .h5 matrix, it would be nice to have the possibility to load it instead of the csv matrix.

Dario

HelenaLC · 2024-08-27T16:06:54Z

FYI: base R make.unique() is made to do this...

> make.unique(c("a", "a", "a"))
[1] "a"   "a.1" "a.2"

...and yes, I'd be in favor of 1) keeping both gene symbols and ensembl IDs as rowData(), and 2) taking care of setting & unique-ifying rownames() for the user from the symbols (with a warning when duplicates were found, as Dario suggested).

LiNk-NY · 2024-09-23T20:53:23Z

Hi Estella! @estellad
Can you try this again with TENxIO version 1.7.7 installed? The latest devel version is available on GitHub until it gets built on the BBS.

BiocManager::install("waldronlab/TENxIO")

LiNk-NY · 2024-09-23T20:53:49Z

About the .h5 matrix, it would be nice to have the possibility to load it instead of the csv matrix.

@drighelli
Thanks! This is not supported in VisiumIO

LiNk-NY · 2024-09-23T20:58:37Z

Note. By default, I'm reading in the first column in the features.tsv.gz file which is typically the ID column with ensembl IDs. And HDF5Array::TENxMatrix uses the ensembl IDs as rownames as well.

estellad · 2024-09-24T09:46:51Z

Hi Marcel @LiNk-NY ,

I tried to install TENxIO and then re-install VisiumIO.

BiocManager::install("waldronlab/TENxIO")
BiocManager::install("VisiumIO", force = TRUE)

The imported Visium object now has gene Ensembl ID as the default rownames and gene Symbol stored in rowData(). 👍

Here are still some action items 😄 :

The duplicate gene Symbols have not been fixed with make.unique(), and no warning message on the duplicated gene names are given.
I tried to remove the Matrix folder in /outs, but VisiumIO::TENxVisium() still does not take only .h5 as input. Do you plan on linking the function to TENxIO::TENxH5(). It should be a very simple update but would accept another file format.

Thank you!
Estella

LiNk-NY · 2024-09-24T14:42:25Z

Hi Estella, @estellad

The duplicate gene Symbols have not been fixed with make.unique(), and no warning message on the duplicated gene names are given.

This is actually implemented here https://github.com/waldronlab/TENxIO/blob/7b86cc13a75a3bebc4def176ac5ca334bcadf9a9/R/TENxH5-class.R#L338-L343
but it only happens with rownames. Do you mean going into rowData()$Symbols and running make.unique() on it? It think it would be better not to munge the data unless necessary but I could add a helper function to do that if needed..

I tried to remove the Matrix folder in /outs, but VisiumIO::TENxVisium() still does not take only .h5 as input. Do you plan on linking the function to TENxIO::TENxH5(). It should be a very simple update but would accept another file format.

Do you have a reproducible example? I have tested this with the examples in the package. See :

VisiumIO/R/TENxVisium-class.R

Lines 175 to 180 in 25324bb

    
           #' tvfl <- TENxVisium( 
        
           #'     spacerangerOut = sample_dir, 
        
           #'     format = "h5", 
        
           #'     processing = "raw", 
        
           #'     images = "lowres" 
        
           #' )

Thanks!

-Marcel

estellad mentioned this issue Sep 3, 2024

SpatialExperimentIO Bioconductor/Contributions#3370

Open

10 tasks

LiNk-NY self-assigned this Sep 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suggestions on: Keep ensembl gene ID at read-in, automate non-unique gene renaming, add default rownames(), accept .h5 format :) #4

Suggestions on: Keep ensembl gene ID at read-in, automate non-unique gene renaming, add default rownames(), accept .h5 format :) #4

estellad commented Aug 27, 2024

drighelli commented Aug 27, 2024

HelenaLC commented Aug 27, 2024

LiNk-NY commented Sep 23, 2024

LiNk-NY commented Sep 23, 2024

LiNk-NY commented Sep 23, 2024

estellad commented Sep 24, 2024 •

edited

Loading

LiNk-NY commented Sep 24, 2024

Suggestions on: Keep ensembl gene ID at read-in, automate non-unique gene renaming, add default rownames(), accept .h5 format :) #4

Suggestions on: Keep ensembl gene ID at read-in, automate non-unique gene renaming, add default rownames(), accept .h5 format :) #4

Comments

estellad commented Aug 27, 2024

drighelli commented Aug 27, 2024

HelenaLC commented Aug 27, 2024

LiNk-NY commented Sep 23, 2024

LiNk-NY commented Sep 23, 2024

LiNk-NY commented Sep 23, 2024

estellad commented Sep 24, 2024 • edited Loading

LiNk-NY commented Sep 24, 2024

estellad commented Sep 24, 2024 •

edited

Loading