Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

05_ aneuploidy inference using copykat and infercnv for a subselection of Wilms tumor from SCPCP000006 #790

Open
maud-p opened this issue Oct 7, 2024 · 3 comments
Labels

Comments

@maud-p
Copy link
Contributor

maud-p commented Oct 7, 2024

If you are filing this issue based on a specific GitHub Discussion, please link to the relevant Discussion.

This PR is following the discussion from the PR#776

Describe the goals of the changes to the analysis module.

On a subselection of samples, I want to try to infer aneuploidy and/or CNV to help identifying normal and cancer cells.

For this, I compare the use of copykat and infercnv. For each of the method, I wanted to compare few parameters.

  1. copykat

In copykat, few parameters can be used to fine-tuned the results. Especially, we can try running copykat with or without a set of normal cells. It is important to note that , CopyKAT had difficulty in predicting tumor and normal cells in the cases of pediatric and liquid tumors that have a few CNAs. CopyKAT provides two ways to bypass this to give certain output instead of being dead staright: 1) input a vector of cell names of known normal cells from the same dataset 2) or try to search for T cells. (see copykat). I thus tested both with and without a reference but I am quite convinced that giving few normal cells help the function.

One parameter I also wanted to test is the clustering method. In copykat, parameters for clustering include "euclidean" distance and correlational distance, ie. 1-"pearson" and "spearman" similarity. In general, corretional distances tend to favor noisy data, while euclidean distance tends to favor data with larger CN segments. I thus tested eucliedean and spearman. In our dataset, I think euclidean (default) is perfoming best.

  1. infercnv

Another way of inferring copy number alterations from tumor single cell RNA-Seq data is using infercnv. In a previous discussion, we were not sure about the impact of the definition of the heatly reference. For that reason, I ran infercvn with no normal cells as reference or immune and/or endothelial cells as reference.

What will your pull request contain?

The PR contains:

  • R scripts to run copykat and infercnv on a selection of samples
  • results from 05_copykat.R and 06_infercnv.R that will be transfer via the s3 bucket.
  • notebook_templates that start looking into copykat and infercnv and compare results
  • notebooks for a selection of samples

Will you require additional software beyond what is already in the analysis module?

no

Will you require different computational resources beyond what the analysis module already uses?

No response

If known, when do you expect to file the pull request?

today or tomorrow

@maud-p maud-p added the analysis label Oct 7, 2024
@sjspielman
Copy link
Member

Hi @maud-p, thanks for filing this issue with your plans for the next steps! It sounds like you've taken some time to explore how to best perform these steps, which is great. The one thing I want to say for now is, we want to make sure that decisions you've made here are visible in the module itself. Just as one example, you wrote above,

I thus tested eucliedean and spearman. In our dataset, I think euclidean (default) is perfoming best.

There should be some result in the module that indeed demonstrates this. In other words, we don't want to only have code running euclidean distance without also a notebook or so that can provide evidence for euclidean outperforming spearman to ultimately bolster the results. In this case, I would expect to see a notebook as part of the PR (but of course this can be a few smaller PRs depending on how you are structuring the code, which I'm happy to chat more about strategies for!) that demonstrates why you choose euclidean over spearman.

@maud-p
Copy link
Contributor Author

maud-p commented Oct 7, 2024

Hi @sjspielman, thank you for the precision.

For copykat, there will be 2 notebook like 05_cnv_copykat_{distance}_exploration_{sample_id}.html to compare for each distance with, and without using normal cells as reference.

There will be one notebook per sample tested 06_cnv_exploration_{sample_id}.html that just plots the CNV heatmaps of all condition tested, for copykat and infercnv.

A bit long to run and might not be easy to interprete, but that would be the plan ;)
I thought I put all together in one to make it easier. Then we can select 1 or 2 method(s)/parameter combination to investigate more into details in a next step :)

@sjspielman
Copy link
Member

Sounds like a good plan for notebooks! To help support a faster review, it would be best to split this up into a couple PRs (it seems counterintuitive that more PRs will go faster, but it will help in the end!):

  1. A PR that contains code to just run inferCNV and/or copyKAT (assuming this is in a separate script and/or function)
  2. The 05 notebook you describe above
  3. The 06 notebook you describe above

Let me know if this makes sense for how you've written your code or if you had another idea for how to split this up into smaller PRs.

@maud-p maud-p changed the title 05_ aneuploidy inference using copykat for a subselection of Wilms tumor from SCPCP000006 05_ aneuploidy inference using copykat and infercnv for a subselection of Wilms tumor from SCPCP000006 Oct 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants