-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add covariates-only model for comparison in the main notebook #93
Conversation
WIll do a full review later. Just wanted to note |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really great work. It's nice to see all the ROC curves on one plot.
Next commit can you restart and run all on notebook so cell IDs are a sequence from 0.
2.TCGA-MLexample-covariates.py
Outdated
|
||
# ## Median absolute deviation feature selection | ||
|
||
# In[191]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Get rid of legacy cells.
2.TCGA-MLexample-covariates.py
Outdated
|
||
# In[198]: | ||
|
||
# Plot ROC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2.TCGA-MLexample-covariates.py
Outdated
# In[187]: | ||
|
||
# Pre-process expression data for use later | ||
n_components = 65 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably want to go with more components for TP53, since there are so many positives.
Ok, so the two big things left here are:
|
IIRC, you should be able to pass the dataframe directly using ipyvega: |
This looks great. It will be good to have a more up-to-date notebook in the main directory, especially for new people. Are we concerned about including the total number of mutations as a feature? Seems like we are using the thing we are trying to predict. I removed the |
I had the same question about including number of mutations at the last meetup. @dhimmel has a good explanation for why it is kept. |
Only in a very minor way. There are ~20000 genes where mutation is possible. We are fitting a classifier on only a single one of those genes. So if you wanted to entirely remove any confounding effect you could subject a single mutation from |
I agree that removing only the gene you are trying to predict from I was more getting at the fact that you would only be able to use the trained classifier on data sets where you already know the total mutation load. I assume if you know the total mutation load you would likely already know if the gene of interest was mutated (I don't know if this is true). #8 was a little over my head but it sounds like including the total mutation load is beneficial and not uncommon, it just may limit the use of the produced classifier (if my assumption isn't off-base). Thanks for the response! |
In some cases this will be true. But in these cases, we'd have the user rerun the notebook and remove the mutation load covariate. |
I changed the ROC plot to use a version of the Vega spec. We can play around with the Vega configurations to make the plot look a little better (color scheme, size, etc.). Maybe we should separate out implementing new CV pipelines into a different PR #96? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Would be great to remove dependencies on temp files if possible.
with open('jupyter_data/roc_vega_spec.json', 'r') as fp: | ||
vega_spec = json.load(fp) | ||
|
||
vega.Vega(vega_spec) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was hoping you could directly pass the dataframe to vega.Vega
via the data
argument. Looking at the source code, I'm not sure whether data
does anything, but worth a try. Did you try?
I see above "TODO: do not save intermediate files?"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's discuss tonight. The data
argument only takes in one dataframe and the way I have set up the Vega spec takes 2 dataframes (full FPR + TPR and AUROC summary). In order to use the data argument directly, I think we would need to calculate the AUROC in Vega. I'm not sure that is sufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. Want to give a small overview on your progress with this PR thus far at the start of the meetup?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, the MVP of cognoma is going to be providing a Jupyter notebook, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed in cognoma/cognoma#63 yeah. We may still also provide a webapp view, but I think the notebook should be the MVP priority for viewing results.
Agreed! |
Ok, let's split out the remaining parts into two new pull requests:
|
@patrick-miller are you waiting on me / is this ready to merge on your end? |
The covariates model part is good to go, unless you have any other issues. What needs to be updated are the visualization and PCA in the pipeline, both of which are a bit tangential. |
Following from the great work done by @joshlevy89 in #67, I have created a notebook that takes the covariates and runs three models in the same vein as the main directory notebooks:
This notebook strays a little bit in that it uses PCA instead of the SelectKBest, but I think that is where we are moving anyway. Each of the models is fit on the same partitions, and metrics are calculated for each. In most of the places, I store the results for each model, but in a few spots where we show the results, I only access one model. I can change it either way if you would like, let me know what you would like @dhimmel.
One little "cheat" that we are performing here is running PCA on the entire expression matrix instead of just on the train partition. In a separate pull request, I think we should migrate the PCA to being performed inside the pipeline. However, that is not straightforward.