Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Visium heart dataset #67

Open
wxicu opened this issue Nov 7, 2024 · 5 comments
Open

Visium heart dataset #67

wxicu opened this issue Nov 7, 2024 · 5 comments

Comments

@wxicu
Copy link

wxicu commented Nov 7, 2024

Hi, thanks for implementing this fantastic tool!

I'm wondering if the Visium heart dataset might be corrupted because the samples GT_IZ_P9 and GT_IZ_P9_rep2 (ACH0012 and ACH0013) appear to be identical. In the meantime, I'm attempting to download the raw data for GT_IZ_P9 from Zenodo (https://zenodo.org/records/6580069) to replace the problematic part in adata. However, I noticed there are additional columns in adata.obs. Would it be possible to share the data preprocessing notebook on GitHub? Thanks a lot!

adata = gc.datasets.visium_heart()
np.array_equal(
    adata[adata.obs['sample'] == 'GT_IZ_P9'].X.toarray(), 
    adata[adata.obs['sample'] == 'GT_IZ_P9_rep2'].X.toarray()
)
@merelkuijs
Copy link
Member

@mayarali, I see that we have uploaded our MIBI-TOF pre-processing notebook (under notebooks/processing), but not our Visium heart one. I checked our hackathon repository and saw that Giovanni concatenated the samples, then saved the concatenated object as adata_processed.h5ad. I'll DM you the path of Giovanni's notebook and the location of the sample objects.

I think it would be good to check if Giovanni's data folder contains any duplicates, but I don't have access to the Helmholtz cluster.

@merelkuijs
Copy link
Member

Thanks for bringing this to our attention, Xichen. I can confirm that the samples are identical, but they aren't supposed to be.

The extra columns you see are probably the columns added after deconvolution. A colleague of ours deconvolved the data using cell2location, but she has since left the lab, and I'm not sure where her script is stored. I've asked her but it might be some days before she replies.

Since the names of the faulty samples are pretty similar, I think something might have gone wrong while saving the deconvolved data. We can check and correct this using our colleague's script.

@merelkuijs
Copy link
Member

merelkuijs commented Nov 25, 2024

Hi Xichen, thanks for your patience!

Our colleague got back to us. According to her, the deconvolution was performed by the original authors. It was made available at https://cellxgene.cziscience.com/collections/8191c283-0816-424b-9b61-c3e1d6258a77. The authors have uploaded data for each sample separately. I checked GT_IZ_P9 and GT_IZ_P9_rep2. They are different, so it seems like the replication happenend when my colleague concatenated the data.

I will try to correct our data soon. Stay tuned!
Merel

@wxicu
Copy link
Author

wxicu commented Nov 25, 2024

Hi thank you for taking care of this. I am working on the multicell project mentored by @bio-la so I have already tried to fix the data myself and happy to share with you in case it helps. My script might look complicated because I need the raw counts to rerun cell2location, you can just skip it. Also I have noticed that the colleague also annotated the genes. The processed data shared by the original author only provides gene names, so I also try to map back to gene ids and annotate in the same way.

adata = gc.datasets.visium_heart()
adata = adata[adata.obs['sample'] != 'GT_IZ_P9']
adata.layers['normalized'] = adata.X.copy()
adata.X = adata.raw.X
var_all = adata.var

adata_p9 = sc.read(f"{DATA_PATH}/Visium_GT_IZ_P9.h5ad")
adata_p9.layers['normalized'] = adata_p9.X.copy()
adata_p9.X = adata_p9.raw.X

# fetch metadata
adata_p9.obs['tissue'] = "heart left ventricle"
adata_p9.obs['sample'] = "GT_IZ_P9"
adata_p9.obs['disease'] = "myocardial infarction"
adata_p9.obs['organism'] = "Homo sapiens"
adata_p9.obs['assay'] = "Visium Spatial Gene Expression"
adata_p9.obs['ethnicity'] = "European"
adata_p9.obs['condition'] = "GT_IZ"
adata_p9.obs['sex'] = 'male'
adata_p9.obs['development_stage'] = "52-year-old human stage"
adata_p9.obs['cell_type'] = adata_p9.obs['cell_type_original'].map(adata.obs[['cell_type','cell_type_original']].set_index('cell_type_original').to_dict()['cell_type'])

# Fetch feature names and ids
diff_var_name = {'TBCE.1':'TBCE-1',
 'LINC01238.1': 'LINC01238-1',
 'CYB561D2.1': 'CYB561D2-1',
 'MATR3.1': 'MATR3-1',
 'HSPA14.1': 'HSPA14-1',
 'TMSB15B.1': 'TMSB15B-1'
 }
adata_p9_raw = sc.read_10x_mtx(f"{DATA_PATH}/ACH0012/outs/Volumes/RicoData2/MI_project/MI_revisions/HCA_submission/spatial/ACH0012/outs/filtered_feature_bc_matrix")
gene_map = adata_p9_raw.var['gene_ids'].to_dict()
adata_p9.var['feature_id'] = adata_p9.var['features'].apply(lambda x: diff_var_name.get(x, x)).map(gene_map)
adata_p9.var = adata_p9.var.set_index('feature_id', drop=True)

adata = sc.concat([adata, adata_p9], axis=0)
adata.var = adata.var.merge(var_all, left_index=True, right_index=True, how='left')
adata.var = adata.var.merge(adata_p9.var, left_index=True, right_on='feature_id')

@wxicu
Copy link
Author

wxicu commented Nov 25, 2024

The h5ad file is downloaded from: https://zenodo.org/records/6578047

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants