Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add covariates-only model for comparison in the main notebook #93

Merged
merged 7 commits into from
Jun 6, 2017

Conversation

patrick-miller
Copy link
Member

@patrick-miller patrick-miller commented May 20, 2017

Following from the great work done by @joshlevy89 in #67, I have created a notebook that takes the covariates and runs three models in the same vein as the main directory notebooks:

  1. Full features -- gene expression matrix (after PCA) + covariates
  2. Covariates -- only the covariates
  3. Expressions -- only the gene expression matrix (after PCA)

This notebook strays a little bit in that it uses PCA instead of the SelectKBest, but I think that is where we are moving anyway. Each of the models is fit on the same partitions, and metrics are calculated for each. In most of the places, I store the results for each model, but in a few spots where we show the results, I only access one model. I can change it either way if you would like, let me know what you would like @dhimmel.

One little "cheat" that we are performing here is running PCA on the entire expression matrix instead of just on the train partition. In a separate pull request, I think we should migrate the PCA to being performed inside the pipeline. However, that is not straightforward.

@dhimmel
Copy link
Member

dhimmel commented May 20, 2017

One little "cheat" that we are performing here is running PCA on the entire expression matrix instead of just on the train partition. In a separate pull request, I think we should migrate the PCA to being performed inside the pipeline. However, that is not straightforward.

WIll do a full review later. Just wanted to note dask-searchcv which should make the PCA in a GridSearchCV pipeline more efficient. http://jcrist.github.io/introducing-dask-searchcv.html

Copy link
Member

@dhimmel dhimmel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really great work. It's nice to see all the ROC curves on one plot.

Next commit can you restart and run all on notebook so cell IDs are a sequence from 0.


# ## Median absolute deviation feature selection

# In[191]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Get rid of legacy cells.


# In[198]:

# Plot ROC
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coloring by model and linetype by test/train would be awesome here. Not sure if this is easy to add.

We may also want to consider replacing this plot with your vega-lite viz. Or maybe even bokeh. Lot's of options, not straightforward which is best.

# In[187]:

# Pre-process expression data for use later
n_components = 65
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably want to go with more components for TP53, since there are so many positives.

@patrick-miller
Copy link
Member Author

Ok, so the two big things left here are:

  • updating the plot (I think just using the Vega spec makes sense, I will write a wrapper to dump the metrics to JSON)
  • building the PCA into the CV pipeline: going to try this

@dhimmel
Copy link
Member

dhimmel commented May 25, 2017

updating the plot (I think just using the Vega spec makes sense, I will write a wrapper to dump the metrics to JSON)

IIRC, you should be able to pass the dataframe directly using ipyvega:

@rdvelazquez
Copy link
Member

This looks great. It will be good to have a more up-to-date notebook in the main directory, especially for new people.

Are we concerned about including the total number of mutations as a feature? Seems like we are using the thing we are trying to predict.

I removed the selected_cols.append('n_mutations_log1p') line and ran the notebook with a few different random states to see how much the total mutations feature adds... the testing AUROC seems to drop about 0.5%-1% for the full dataset and about 2% for the covariates only.

@patrick-miller
Copy link
Member Author

I had the same question about including number of mutations at the last meetup. @dhimmel has a good explanation for why it is kept.

@dhimmel
Copy link
Member

dhimmel commented May 25, 2017

Are we concerned about including the total number of mutations as a feature? Seems like we are using the thing we are trying to predict.

Only in a very minor way. There are ~20000 genes where mutation is possible. We are fitting a classifier on only a single one of those genes. So if you wanted to entirely remove any confounding effect you could subject a single mutation from n_mutation for samples that were positives. However, I suspect the effect will be trivial. See #8 for why including the mutation load covariate is important.

@rdvelazquez
Copy link
Member

I agree that removing only the gene you are trying to predict from n_mutation would have a trivial impact.

I was more getting at the fact that you would only be able to use the trained classifier on data sets where you already know the total mutation load. I assume if you know the total mutation load you would likely already know if the gene of interest was mutated (I don't know if this is true).

#8 was a little over my head but it sounds like including the total mutation load is beneficial and not uncommon, it just may limit the use of the produced classifier (if my assumption isn't off-base).

Thanks for the response!

@dhimmel
Copy link
Member

dhimmel commented May 26, 2017

I assume if you know the total mutation load you would likely already know if the gene of interest was mutated (I don't know if this is true).

In some cases this will be true. But in these cases, we'd have the user rerun the notebook and remove the mutation load covariate.

@patrick-miller
Copy link
Member Author

patrick-miller commented May 26, 2017

I changed the ROC plot to use a version of the Vega spec. We can play around with the Vega configurations to make the plot look a little better (color scheme, size, etc.).

image

Maybe we should separate out implementing new CV pipelines into a different PR #96?

Copy link
Member

@dhimmel dhimmel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Would be great to remove dependencies on temp files if possible.

with open('jupyter_data/roc_vega_spec.json', 'r') as fp:
vega_spec = json.load(fp)

vega.Vega(vega_spec)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was hoping you could directly pass the dataframe to vega.Vega via the data argument. Looking at the source code, I'm not sure whether data does anything, but worth a try. Did you try?

I see above "TODO: do not save intermediate files?"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's discuss tonight. The data argument only takes in one dataframe and the way I have set up the Vega spec takes 2 dataframes (full FPR + TPR and AUROC summary). In order to use the data argument directly, I think we would need to calculate the AUROC in Vega. I'm not sure that is sufficient.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Want to give a small overview on your progress with this PR thus far at the start of the meetup?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, the MVP of cognoma is going to be providing a Jupyter notebook, correct?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed in cognoma/cognoma#63 yeah. We may still also provide a webapp view, but I think the notebook should be the MVP priority for viewing results.

@dhimmel
Copy link
Member

dhimmel commented May 30, 2017

Maybe we should separate out implementing new CV pipelines into a different PR

Agreed!

@patrick-miller
Copy link
Member Author

Ok, let's split out the remaining parts into two new pull requests:

  • fixing visualization (keep interactivity, get rid of temporary files)
  • incorporate PCA just on the gene expressions into the pipeline

@dhimmel
Copy link
Member

dhimmel commented Jun 6, 2017

@patrick-miller are you waiting on me / is this ready to merge on your end?

@patrick-miller
Copy link
Member Author

The covariates model part is good to go, unless you have any other issues. What needs to be updated are the visualization and PCA in the pipeline, both of which are a bit tangential.

@rdvelazquez rdvelazquez mentioned this pull request Jun 6, 2017
@dhimmel dhimmel merged commit 2b07eed into cognoma:master Jun 6, 2017
@patrick-miller patrick-miller deleted the add-covariates-model branch June 6, 2017 15:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants