Add covariates-only model for comparison in the main notebook #93

patrick-miller · 2017-05-20T00:11:51Z

Following from the great work done by @joshlevy89 in #67, I have created a notebook that takes the covariates and runs three models in the same vein as the main directory notebooks:

Full features -- gene expression matrix (after PCA) + covariates
Covariates -- only the covariates
Expressions -- only the gene expression matrix (after PCA)

This notebook strays a little bit in that it uses PCA instead of the SelectKBest, but I think that is where we are moving anyway. Each of the models is fit on the same partitions, and metrics are calculated for each. In most of the places, I store the results for each model, but in a few spots where we show the results, I only access one model. I can change it either way if you would like, let me know what you would like @dhimmel.

One little "cheat" that we are performing here is running PCA on the entire expression matrix instead of just on the train partition. In a separate pull request, I think we should migrate the PCA to being performed inside the pipeline. However, that is not straightforward.

dhimmel · 2017-05-20T05:05:36Z

One little "cheat" that we are performing here is running PCA on the entire expression matrix instead of just on the train partition. In a separate pull request, I think we should migrate the PCA to being performed inside the pipeline. However, that is not straightforward.

WIll do a full review later. Just wanted to note dask-searchcv which should make the PCA in a GridSearchCV pipeline more efficient. http://jcrist.github.io/introducing-dask-searchcv.html

dhimmel

Really great work. It's nice to see all the ROC curves on one plot.

Next commit can you restart and run all on notebook so cell IDs are a sequence from 0.

dhimmel · 2017-05-24T21:27:20Z

2.TCGA-MLexample-covariates.py

+
+# ## Median absolute deviation feature selection
+
+# In[191]:


Get rid of legacy cells.

dhimmel · 2017-05-24T21:32:23Z

2.TCGA-MLexample-covariates.py

+
+# In[198]:
+
+# Plot ROC


Coloring by model and linetype by test/train would be awesome here. Not sure if this is easy to add.

We may also want to consider replacing this plot with your vega-lite viz. Or maybe even bokeh. Lot's of options, not straightforward which is best.

dhimmel · 2017-05-24T21:34:02Z

2.TCGA-MLexample-covariates.py

+# In[187]:
+
+# Pre-process expression data for use later
+n_components = 65


We probably want to go with more components for TP53, since there are so many positives.

patrick-miller · 2017-05-25T01:33:50Z

Ok, so the two big things left here are:

updating the plot (I think just using the Vega spec makes sense, I will write a wrapper to dump the metrics to JSON)
building the PCA into the CV pipeline: going to try this

dhimmel · 2017-05-25T03:25:25Z

updating the plot (I think just using the Vega spec makes sense, I will write a wrapper to dump the metrics to JSON)

IIRC, you should be able to pass the dataframe directly using ipyvega:

rdvelazquez · 2017-05-25T18:53:13Z

This looks great. It will be good to have a more up-to-date notebook in the main directory, especially for new people.

Are we concerned about including the total number of mutations as a feature? Seems like we are using the thing we are trying to predict.

I removed the selected_cols.append('n_mutations_log1p') line and ran the notebook with a few different random states to see how much the total mutations feature adds... the testing AUROC seems to drop about 0.5%-1% for the full dataset and about 2% for the covariates only.

patrick-miller · 2017-05-25T19:25:29Z

I had the same question about including number of mutations at the last meetup. @dhimmel has a good explanation for why it is kept.

dhimmel · 2017-05-25T21:19:57Z

Are we concerned about including the total number of mutations as a feature? Seems like we are using the thing we are trying to predict.

Only in a very minor way. There are ~20000 genes where mutation is possible. We are fitting a classifier on only a single one of those genes. So if you wanted to entirely remove any confounding effect you could subject a single mutation from n_mutation for samples that were positives. However, I suspect the effect will be trivial. See #8 for why including the mutation load covariate is important.

rdvelazquez · 2017-05-25T21:49:05Z

I agree that removing only the gene you are trying to predict from n_mutation would have a trivial impact.

I was more getting at the fact that you would only be able to use the trained classifier on data sets where you already know the total mutation load. I assume if you know the total mutation load you would likely already know if the gene of interest was mutated (I don't know if this is true).

#8 was a little over my head but it sounds like including the total mutation load is beneficial and not uncommon, it just may limit the use of the produced classifier (if my assumption isn't off-base).

Thanks for the response!

dhimmel · 2017-05-26T16:10:25Z

I assume if you know the total mutation load you would likely already know if the gene of interest was mutated (I don't know if this is true).

In some cases this will be true. But in these cases, we'd have the user rerun the notebook and remove the mutation load covariate.

patrick-miller · 2017-05-26T19:13:12Z

I changed the ROC plot to use a version of the Vega spec. We can play around with the Vega configurations to make the plot look a little better (color scheme, size, etc.).

Maybe we should separate out implementing new CV pipelines into a different PR #96?

dhimmel

Looks good. Would be great to remove dependencies on temp files if possible.

dhimmel · 2017-05-30T17:48:13Z

2.TCGA-MLexample-covariates.py

+with open('jupyter_data/roc_vega_spec.json', 'r') as fp:
+    vega_spec = json.load(fp)
+
+vega.Vega(vega_spec)


I was hoping you could directly pass the dataframe to vega.Vega via the data argument. Looking at the source code, I'm not sure whether data does anything, but worth a try. Did you try?

I see above "TODO: do not save intermediate files?"

Let's discuss tonight. The data argument only takes in one dataframe and the way I have set up the Vega spec takes 2 dataframes (full FPR + TPR and AUROC summary). In order to use the data argument directly, I think we would need to calculate the AUROC in Vega. I'm not sure that is sufficient.

Sounds good. Want to give a small overview on your progress with this PR thus far at the start of the meetup?

Sure, the MVP of cognoma is going to be providing a Jupyter notebook, correct?

As discussed in cognoma/cognoma#63 yeah. We may still also provide a webapp view, but I think the notebook should be the MVP priority for viewing results.

dhimmel · 2017-05-30T17:51:47Z

Maybe we should separate out implementing new CV pipelines into a different PR

Agreed!

patrick-miller · 2017-05-30T23:55:34Z

Ok, let's split out the remaining parts into two new pull requests:

fixing visualization (keep interactivity, get rid of temporary files)
incorporate PCA just on the gene expressions into the pipeline

dhimmel · 2017-06-06T15:05:48Z

@patrick-miller are you waiting on me / is this ready to merge on your end?

patrick-miller · 2017-06-06T15:08:29Z

The covariates model part is good to go, unless you have any other issues. What needs to be updated are the visualization and PCA in the pipeline, both of which are a bit tangential.

pwmiller added 3 commits April 25, 2017 20:24

Create new file for working on covariates based models

2bbb54d

Add notebook for comparing full model with covariates

f53a426

Include .py file

6ad4cad

dhimmel reviewed May 24, 2017

View reviewed changes

Fix cell numbers and remove excess code

1b46757

This was referenced May 26, 2017

Evaluate dask-searchcv to speed up GridSearchCV #94

Closed

Create benchmark data sets #11

Closed

pwmiller added 2 commits May 26, 2017 15:09

Switch ROC plot to using the Vega specification

9769d8e

Add .py file

204eee5

Update conconda environment

21fdbdf

dhimmel reviewed May 30, 2017

View reviewed changes

rdvelazquez mentioned this pull request Jun 6, 2017

Dask search cv #98

Closed

dhimmel approved these changes Jun 6, 2017

View reviewed changes

dhimmel merged commit 2b07eed into cognoma:master Jun 6, 2017

patrick-miller deleted the add-covariates-model branch June 6, 2017 15:40

rdvelazquez mentioned this pull request Jun 7, 2017

Upload dask-searchCV notebook #101

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add covariates-only model for comparison in the main notebook #93

Add covariates-only model for comparison in the main notebook #93

patrick-miller commented May 20, 2017 •

edited

Loading

dhimmel commented May 20, 2017

dhimmel left a comment

dhimmel May 24, 2017

dhimmel May 24, 2017

dhimmel May 24, 2017

patrick-miller commented May 25, 2017

dhimmel commented May 25, 2017

rdvelazquez commented May 25, 2017

patrick-miller commented May 25, 2017

dhimmel commented May 25, 2017

rdvelazquez commented May 25, 2017

dhimmel commented May 26, 2017 •

edited

Loading

patrick-miller commented May 26, 2017 •

edited

Loading

dhimmel left a comment

dhimmel May 30, 2017

patrick-miller May 30, 2017

dhimmel May 30, 2017

patrick-miller May 30, 2017

dhimmel May 30, 2017

dhimmel commented May 30, 2017

patrick-miller commented May 30, 2017

dhimmel commented Jun 6, 2017

patrick-miller commented Jun 6, 2017

Add covariates-only model for comparison in the main notebook #93

Add covariates-only model for comparison in the main notebook #93

Conversation

patrick-miller commented May 20, 2017 • edited Loading

dhimmel commented May 20, 2017

dhimmel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

patrick-miller commented May 25, 2017

dhimmel commented May 25, 2017

rdvelazquez commented May 25, 2017

patrick-miller commented May 25, 2017

dhimmel commented May 25, 2017

rdvelazquez commented May 25, 2017

dhimmel commented May 26, 2017 • edited Loading

patrick-miller commented May 26, 2017 • edited Loading

dhimmel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dhimmel commented May 30, 2017

patrick-miller commented May 30, 2017

dhimmel commented Jun 6, 2017

patrick-miller commented Jun 6, 2017

patrick-miller commented May 20, 2017 •

edited

Loading

dhimmel commented May 26, 2017 •

edited

Loading

patrick-miller commented May 26, 2017 •

edited

Loading