-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Evaluate performance of covariates at predicting various mutations #47
Conversation
Creates an explore directory and README for this type of exploratory notebook. See how well covariates (non-expression features) predict TP53 mutation. Related to cognoma#8: General mutation-load does provide some ability to predict mutation status of TP53. Partially addresses cognoma#21: Covariates are extracted from samples.tsv.
This isn't too surprising given TP53's role in controlling cell cycle checkpoints. Is this only true for genes in cell cycle checkpoint or DNA repair pathways, or is it also true for genes in proliferation pathways? |
Evaluate covariate-only classifiers for the interesting mutations compiled in cognoma/cancer-data#22 (comment). Switches to an expand grid system for evaluating all possible covariate combinations. Plot performance of all covariates on each mutation. Switches to `covariates.tsv` created in cognoma/cancer-data#24 for encoded covariates.
@cgreene, I updated the notebook to evaluate performance of covariate-only classifiers for the 8 interesting mutations we've previously considered. Here is performance of the models with all covariates included: So actually, TP53 is among the hardest to predict using only covariates. VHL which is highly disease-specific achieves a near-perfect AUROC. Therefore, without the disease or organ covariate, expression classifiers of VHL are likely just classifying kidney clear cell carcinoma / kidney tissue. See this dataframe to get a general idea of covariate importance. Mutation load and disease type both seem important. |
@dhimmel : Interesting! I can imagine that gene expression would clearly capture disease/organ, so if one usually had a mutation in a single gene that was relatively specific (e.g. VHL) I could imagine that creating a strong signal. |
@@ -0,0 +1,280 @@ | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like you need to nbconvert
this again
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll take a look again once it's easier to read
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. should be fixed now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like nbconvert confounding.ipynb
has to be rerun
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting and informative analysis. My comments are generally cosmetic, but it would be nice to get some additional documentation on some of the functions!
@@ -0,0 +1,9 @@ | |||
# A directory for exploratory machine learning analyses | |||
|
|||
This directory is home is exploratory analyses that help answer questions about how we should do machine learning. For algorithm implementations see the [`algorithms`](../algorithms) directory. For other types of analyses, place them here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"This directory is home to exploratory analyses"?
|
||
# coding: utf-8 | ||
|
||
# # Create a logistic regression model to several mutations from covariates |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Create a logistic regression model to predict several mutations from covariates"?
return pd.DataFrame.from_records(rows, columns=data_dict.keys()) | ||
|
||
mutations = { | ||
'7157': 'TP53', # tumor protein p53 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pep8 says inline comments should be separated by two spaces
# In[8]: | ||
|
||
def get_aurocs(X, y, series): | ||
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does series
look like?
Creates an explore directory and README for this type of exploratory notebook.
See how well covariates (non-expression features) predict TP53 mutation.
Related to #8: General mutation-load does provide some ability to predict mutation status of TP53.
Partially addresses #21: Covariates are extracted from
samples.tsv
.