Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add exploratory analyses of mutation data #22

Merged
merged 2 commits into from
Sep 7, 2016
Merged

Conversation

dhimmel
Copy link
Member

@dhimmel dhimmel commented Sep 6, 2016

This pull request is based on a preliminary notebook we created at the 2016-08-23 Cognoma Meetup. Tagging @Mike1906 @stephenshank, @drolejoel, @linzho, who were part of this group (we'd love your feedback).

Specifically, I'd like feedback on interested cancer genes where we expect to see mutation status segregate with disease. For example, the present notebook shows the enrichment of VHL for kidney clear cell carcinoma.

Based on a preliminary notebook we created at the 2016-08-23 Cognoma Meetup
(https://www.meetup.com/DataPhilly/events/233403001/).
@gwaybio
Copy link
Member

gwaybio commented Sep 7, 2016

BRAF should segregate to melanoma and subsets of lung cancer

BRAFV600E should be a good test for the machine learning group once we get the columns mentioned in #16

Can also visualize BRCA1 and BRCA2 - will largely segregate into breast and subsets of ovarian, cervical, and uterine cancers as well.

@gwaybio
Copy link
Member

gwaybio commented Sep 7, 2016

Can you also add ALK - should segregate into subsets of lung cancer. ALK is interesting because it is inactivated usually by chromosomal rearrangements and I suspect a gene expression signature for ALK inactivation could be interesting

@gwaybio
Copy link
Member

gwaybio commented Sep 7, 2016

could possibly incorporate COSMIC here too

@linzho
Copy link

linzho commented Sep 7, 2016

You can also look at MEN1 and RET, genes which is associated with a lot of neuroendocrine things (pancreas, pituitary, parathyroid, medullary thyroid, pheochromocytoma)

Are you interested in genes associated with cancers in general, or genes where we might expect that the majority of cancers segregate with a single gene?

@dhimmel
Copy link
Member Author

dhimmel commented Sep 7, 2016

Are you interested in genes associated with cancers in general, or genes where we might expect that the majority of cancers segregate with a single gene?

@linzho both. Since this is an exploratory analysis, I'm just looking to look!

@dhimmel
Copy link
Member Author

dhimmel commented Sep 7, 2016

@linzho & @gwaygenomics thanks for your suggestions. I added them to the heatmap in 29c926a, which now looks like this:

heatmap

I also scaled the mutation rates for each gene by the max mutation rate. Note that there is still the outstanding issue that some diseases harbor more mutations (see row-wise bands above & cognoma/machine-learning#8).

@gwaybio
Copy link
Member

gwaybio commented Sep 7, 2016

would it be useful to add functionality to the script? if the final output is the mutation by tissue heatmap could you add an argparse argument? So the above graph would be generated like:

python scripts/3.explore-mutations.py --gene-list "BRCA2,ALK,CD274,MEN1,VHL,RET,TP53,BRCA1"

just a thought

@dhimmel
Copy link
Member Author

dhimmel commented Sep 7, 2016

@gwaygenomics I have a slightly different philosophy here.

scripts/3.explore-mutations.py is an auto-exported script version of the notebook for diff viewing. So all code changes should be done to the notebook. Passing args to the notebook doesn't make sense because you should be able to use notebooks interactively.

So one option is to create a python module, e.g. heatmap.py which has a function that 3.explore-mutations.ipynb would call and has a __main__ that could enable script execution. However, I don't really see a major benefit that justifies the added complexity. If you want to add more genes, you can just open the notebook and add genes to the dictionary.

IMO, notebooks are better than scripts with arguments for agile data science.

@gwaybio
Copy link
Member

gwaybio commented Sep 7, 2016

got it - i agree for this script.

Although I do think that moving towards this philosophy in terms of thinking about functionality for how a user will visualize input genes and input tissues (i.e. the frontend/cancer data discussion yesterday - see cognoma/frontend#12) will be important.

LGTM 👍

@dhimmel dhimmel merged commit 67f8032 into cognoma:master Sep 7, 2016
dhimmel added a commit to dhimmel/machine-learning that referenced this pull request Sep 19, 2016
Evaluate covariate-only classifiers for the interesting mutations compiled in
cognoma/cancer-data#22 (comment).

Switches to an expand grid system for evaluating all possible covariate
combinations.

Plot performance of all covariates on each mutation.

Switches to `covariates.tsv` created in
cognoma/cancer-data#24 for encoded covariates.
dhimmel added a commit to cognoma/machine-learning that referenced this pull request Sep 22, 2016
* Evaluate performance of covariates on TP53

Creates an explore directory and README for this type of exploratory notebook.

See how well covariates (non-expression features) predict TP53 mutation.

Related to #8:
General mutation-load does provide some ability to predict mutation status of
TP53.

Partially addresses #21:
Covariates are extracted from samples.tsv.

* Evaluate more covariate/mutation combinations

Evaluate covariate-only classifiers for the interesting mutations compiled in
cognoma/cancer-data#22 (comment).

Switches to an expand grid system for evaluating all possible covariate
combinations.

Plot performance of all covariates on each mutation.

Switches to `covariates.tsv` created in
cognoma/cancer-data#24 for encoded covariates.

* Export clean notebook to script

* Address review comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants