Evaluate performance of covariates at predicting various mutations #47

dhimmel · 2016-09-15T15:12:33Z

Creates an explore directory and README for this type of exploratory notebook.

See how well covariates (non-expression features) predict TP53 mutation.

Related to #8: General mutation-load does provide some ability to predict mutation status of TP53.

Partially addresses #21: Covariates are extracted from samples.tsv.

Creates an explore directory and README for this type of exploratory notebook. See how well covariates (non-expression features) predict TP53 mutation. Related to cognoma#8: General mutation-load does provide some ability to predict mutation status of TP53. Partially addresses cognoma#21: Covariates are extracted from samples.tsv.

cgreene · 2016-09-15T15:45:38Z

Related to #8: General mutation-load does provide some ability to predict mutation status of TP53.

This isn't too surprising given TP53's role in controlling cell cycle checkpoints. Is this only true for genes in cell cycle checkpoint or DNA repair pathways, or is it also true for genes in proliferation pathways?

Evaluate covariate-only classifiers for the interesting mutations compiled in cognoma/cancer-data#22 (comment). Switches to an expand grid system for evaluating all possible covariate combinations. Plot performance of all covariates on each mutation. Switches to `covariates.tsv` created in cognoma/cancer-data#24 for encoded covariates.

dhimmel · 2016-09-19T16:32:47Z

Is this only true for genes in cell cycle checkpoint or DNA repair pathways, or is it also true for genes in proliferation pathways?

@cgreene, I updated the notebook to evaluate performance of covariate-only classifiers for the 8 interesting mutations we've previously considered. Here is performance of the models with all covariates included:

So actually, TP53 is among the hardest to predict using only covariates. VHL which is highly disease-specific achieves a near-perfect AUROC. Therefore, without the disease or organ covariate, expression classifiers of VHL are likely just classifying kidney clear cell carcinoma / kidney tissue.

See this dataframe to get a general idea of covariate importance. Mutation load and disease type both seem important.

cgreene · 2016-09-19T17:57:12Z

@dhimmel : Interesting! I can imagine that gene expression would clearly capture disease/organ, so if one usually had a mutation in a single gene that was relatively specific (e.g. VHL) I could imagine that creating a strong signal.

gwaybio · 2016-09-19T20:20:11Z

explore/confounding/confounding.py

@@ -0,0 +1,280 @@
+


looks like you need to nbconvert this again

I'll take a look again once it's easier to read

Good catch. should be fixed now.

gwaybio

looks like nbconvert confounding.ipynb has to be rerun

gwaybio

Interesting and informative analysis. My comments are generally cosmetic, but it would be nice to get some additional documentation on some of the functions!

gwaybio · 2016-09-21T19:48:57Z

explore/README.md

@@ -0,0 +1,9 @@
+# A directory for exploratory machine learning analyses
+
+This directory is home is exploratory analyses that help answer questions about how we should do machine learning. For algorithm implementations see the [`algorithms`](../algorithms) directory. For other types of analyses, place them here.


"This directory is home to exploratory analyses"?

gwaybio · 2016-09-21T19:59:24Z

explore/confounding/confounding.py

+
+# coding: utf-8
+
+# # Create a logistic regression model to several mutations from covariates


"Create a logistic regression model to predict several mutations from covariates"?

gwaybio · 2016-09-21T20:04:43Z

explore/confounding/confounding.py

+    return pd.DataFrame.from_records(rows, columns=data_dict.keys())
+
+mutations = {
+    '7157': 'TP53', # tumor protein p53


pep8 says inline comments should be separated by two spaces

gwaybio · 2016-09-21T20:07:53Z

explore/confounding/confounding.py

+# In[8]:
+
+def get_aurocs(X, y, series):
+    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=0)


what does series look like?

dhimmel mentioned this pull request Sep 15, 2016

First attempt at processing covariate information. #46

Closed

dhimmel assigned dhimmel and gwaybio and unassigned dhimmel Sep 19, 2016

dhimmel changed the title ~~Evaluate performance of covariates on TP53~~ Evaluate performance of covariates at predicting various mutations Sep 19, 2016

gwaybio reviewed Sep 19, 2016

View reviewed changes

gwaybio requested changes Sep 19, 2016

View reviewed changes

Export clean notebook to script

7ef10dd

gwaybio requested changes Sep 21, 2016

View reviewed changes

Address review comments

b7d2fad

gwaybio approved these changes Sep 21, 2016

View reviewed changes

dhimmel merged commit cbc0604 into cognoma:master Sep 22, 2016

dhimmel deleted the confounding branch September 22, 2016 13:28

dhimmel mentioned this pull request Oct 26, 2016

TP53 mutation prediction from metadata #66

Closed

dhimmel mentioned this pull request Nov 9, 2016

Marginal gain of gene expression data over covariates #67

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluate performance of covariates at predicting various mutations #47

Evaluate performance of covariates at predicting various mutations #47

dhimmel commented Sep 15, 2016

cgreene commented Sep 15, 2016

dhimmel commented Sep 19, 2016

cgreene commented Sep 19, 2016

gwaybio Sep 19, 2016

gwaybio Sep 19, 2016

dhimmel Sep 21, 2016

gwaybio left a comment

gwaybio left a comment

gwaybio Sep 21, 2016 •

edited

Loading

gwaybio Sep 21, 2016

gwaybio Sep 21, 2016

gwaybio Sep 21, 2016

		@@ -0,0 +1,9 @@
		# A directory for exploratory machine learning analyses

		This directory is home is exploratory analyses that help answer questions about how we should do machine learning. For algorithm implementations see the [`algorithms`](../algorithms) directory. For other types of analyses, place them here.


		# coding: utf-8

		# # Create a logistic regression model to several mutations from covariates

Evaluate performance of covariates at predicting various mutations #47

Evaluate performance of covariates at predicting various mutations #47

Conversation

dhimmel commented Sep 15, 2016

cgreene commented Sep 15, 2016

dhimmel commented Sep 19, 2016

cgreene commented Sep 19, 2016

gwaybio Sep 19, 2016

Choose a reason for hiding this comment

gwaybio Sep 19, 2016

Choose a reason for hiding this comment

dhimmel Sep 21, 2016

Choose a reason for hiding this comment

gwaybio left a comment

Choose a reason for hiding this comment

gwaybio left a comment

Choose a reason for hiding this comment

gwaybio Sep 21, 2016 • edited Loading

Choose a reason for hiding this comment

gwaybio Sep 21, 2016

Choose a reason for hiding this comment

gwaybio Sep 21, 2016

Choose a reason for hiding this comment

gwaybio Sep 21, 2016

Choose a reason for hiding this comment

gwaybio Sep 21, 2016 •

edited

Loading