-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What covariates should we include as features? #21
Comments
@stephenshank began work at the last meetup on creating a |
See how well covariates (non-expression features) predict TP53 mutation. Related to cognoma#8: General mutation-load does provide some ability to predict mutation status of TP53. Partially addresses cognoma#21: Covariates are extracted from samples.tsv.
My first attempt at this can be found at #46, where I simply try to process the samples data to begin using these as features. Immediate issues are...
As a longer term issue, I was eager to do some actual machine learning, but my naive attempt at using the existing classifier failed. So it would be great to get some discussion going, regarding what tweaks we expect to enhance performance. Now reading through more issues, I realize I should've tried this on some of the Hippo pathway genes. But I am worried that munging the features together as I have is not an effective approach. Of course I would also welcome discussion on anything I may have missed... @dhimmel, @gwaygenomics any thoughts? |
See how well covariates (non-expression features) predict TP53 mutation. Related to cognoma#8: General mutation-load does provide some ability to predict mutation status of TP53. Partially addresses cognoma#21: Covariates are extracted from samples.tsv.
Creates an explore directory and README for this type of exploratory notebook. See how well covariates (non-expression features) predict TP53 mutation. Related to cognoma#8: General mutation-load does provide some ability to predict mutation status of TP53. Partially addresses cognoma#21: Covariates are extracted from samples.tsv.
As per my review comment on #46: yes, although imputation is definitely an option here.
I know of 3 options: impute, filter observations, or remove variable. Since we don't want to start hemorrhaging samples, I don't think we should filter many observations. So maybe we can assess imputation/removal on a per-variable basis. i.e. will it impute, if so impute and keep, else remove.
Personally, I make all_lowercase_underscore_separated variable names. However, I do see a benefit in not messing with Xena names unless we store a reversible mapping. In other words, if using foreign data, it's sometimes better to use dirty column names than break interoperability. However, lot's of these variables have already been changed or recoded in cancer-data, so interoperability is less of a worry for those variables. |
I'm tempted to do this, just because I consider this part of clean data for developers. It can get frustrating trying to autocomplete variables that you know are there, only to remember that THOSE variables are upper case. I hope to have some progress on imputing the numeric variables for the next PR. The only strategies I know for this are either 1) filling in the most common values among similar cases, or 2) exploring correlations. For instance, we should be able to impute the missing |
@dhimmel I made the proposed changes and started to do some exploratory visualization for imputation, which I've pushed for review. Also note that the notebook and the script are in two separate commits, since I forgot to convert before I pushed 😬. Please let me know if there are any more revisions, I am happy to continue working on this. Also, any suggestions for how to carry out the imputation are welcome... I've made some of my own in the notebook. |
@stephenshank, let's keep discussion related to PR #46 on the actual pull request. I didn't see your latest two comments until after my most recent review ): Let's start a new issue for covariate imputation and deal with it in a future pull request. |
* Evaluate performance of covariates on TP53 Creates an explore directory and README for this type of exploratory notebook. See how well covariates (non-expression features) predict TP53 mutation. Related to #8: General mutation-load does provide some ability to predict mutation status of TP53. Partially addresses #21: Covariates are extracted from samples.tsv. * Evaluate more covariate/mutation combinations Evaluate covariate-only classifiers for the interesting mutations compiled in cognoma/cancer-data#22 (comment). Switches to an expand grid system for evaluating all possible covariate combinations. Plot performance of all covariates on each mutation. Switches to `covariates.tsv` created in cognoma/cancer-data#24 for encoded covariates. * Export clean notebook to script * Address review comments
In addition to gene expression, we probably should include other information on samples. This discussion will focus on identifying potential covariates and evaluating whether they make sense to include in models. If we don't include the right covariates, confounding is likely to be an issue.
See #8 as a potential example of confounding that may be addressable by adding a mutation load feature.
The text was updated successfully, but these errors were encountered: