What covariates should we include as features? #21

dhimmel · 2016-07-28T20:53:54Z

In addition to gene expression, we probably should include other information on samples. This discussion will focus on identifying potential covariates and evaluating whether they make sense to include in models. If we don't include the right covariates, confounding is likely to be an issue.

See #8 as a potential example of confounding that may be addressable by adding a mutation load feature.

dhimmel · 2016-09-12T15:17:35Z

@stephenshank began work at the last meetup on creating a covariates.tsv from samples.tsv. @stephenshank any updates?

See how well covariates (non-expression features) predict TP53 mutation. Related to cognoma#8: General mutation-load does provide some ability to predict mutation status of TP53. Partially addresses cognoma#21: Covariates are extracted from samples.tsv.

stephenshank · 2016-09-15T13:45:01Z

My first attempt at this can be found at #46, where I simply try to process the samples data to begin using these as features. Immediate issues are...

Are we comfortable with the way categorical NaN's are handled?
How do we want to treat numeric NaN's? The current pipeline breaks when data contains NaN's.
Standardizing column names... perhaps all as lowercase/underscore, which seems to be consistent?

As a longer term issue, I was eager to do some actual machine learning, but my naive attempt at using the existing classifier failed. So it would be great to get some discussion going, regarding what tweaks we expect to enhance performance. Now reading through more issues, I realize I should've tried this on some of the Hippo pathway genes. But I am worried that munging the features together as I have is not an effective approach.

Of course I would also welcome discussion on anything I may have missed... @dhimmel, @gwaygenomics any thoughts?

See how well covariates (non-expression features) predict TP53 mutation. Related to cognoma#8: General mutation-load does provide some ability to predict mutation status of TP53. Partially addresses cognoma#21: Covariates are extracted from samples.tsv.

Creates an explore directory and README for this type of exploratory notebook. See how well covariates (non-expression features) predict TP53 mutation. Related to cognoma#8: General mutation-load does provide some ability to predict mutation status of TP53. Partially addresses cognoma#21: Covariates are extracted from samples.tsv.

dhimmel · 2016-09-15T15:39:48Z

Are we comfortable with the way categorical NaN's are handled?

As per my review comment on #46: yes, although imputation is definitely an option here.

How do we want to treat numeric NaN's? The current pipeline breaks when data contains NaN's.

I know of 3 options: impute, filter observations, or remove variable. Since we don't want to start hemorrhaging samples, I don't think we should filter many observations. So maybe we can assess imputation/removal on a per-variable basis. i.e. will it impute, if so impute and keep, else remove.

Standardizing column names... perhaps all as lowercase/underscore, which seems to be consistent?

Personally, I make all_lowercase_underscore_separated variable names. However, I do see a benefit in not messing with Xena names unless we store a reversible mapping. In other words, if using foreign data, it's sometimes better to use dirty column names than break interoperability. However, lot's of these variables have already been changed or recoded in cancer-data, so interoperability is less of a worry for those variables.

stephenshank · 2016-09-16T01:11:43Z

Personally, I make all_lowercase_underscore_separated variable names

I'm tempted to do this, just because I consider this part of clean data for developers. It can get frustrating trying to autocomplete variables that you know are there, only to remember that THOSE variables are upper case.

I hope to have some progress on imputing the numeric variables for the next PR. The only strategies I know for this are either 1) filling in the most common values among similar cases, or 2) exploring correlations. For instance, we should be able to impute the missing age_diagnosed with the correlation that you found with the number of mutations, perhaps through a linear regression. Any other strategies are welcome.

stephenshank · 2016-09-16T13:13:58Z

@dhimmel I made the proposed changes and started to do some exploratory visualization for imputation, which I've pushed for review. Also note that the notebook and the script are in two separate commits, since I forgot to convert before I pushed 😬. Please let me know if there are any more revisions, I am happy to continue working on this.

Also, any suggestions for how to carry out the imputation are welcome... I've made some of my own in the notebook.

dhimmel · 2016-09-16T14:21:06Z

@stephenshank, let's keep discussion related to PR #46 on the actual pull request. I didn't see your latest two comments until after my most recent review ):

Let's start a new issue for covariate imputation and deal with it in a future pull request.

* Evaluate performance of covariates on TP53 Creates an explore directory and README for this type of exploratory notebook. See how well covariates (non-expression features) predict TP53 mutation. Related to #8: General mutation-load does provide some ability to predict mutation status of TP53. Partially addresses #21: Covariates are extracted from samples.tsv. * Evaluate more covariate/mutation combinations Evaluate covariate-only classifiers for the interesting mutations compiled in cognoma/cancer-data#22 (comment). Switches to an expand grid system for evaluating all possible covariate combinations. Plot performance of all covariates on each mutation. Switches to `covariates.tsv` created in cognoma/cancer-data#24 for encoded covariates. * Export clean notebook to script * Address review comments

dhimmel mentioned this issue Jul 28, 2016

Process the clinical matrix to extract sample attributes cognoma/cancer-data#10

Closed

dhimmel added the task label Aug 1, 2016

dhimmel mentioned this issue Sep 12, 2016

Decisions required to reach a minimum viable product #44

Open

stephenshank mentioned this issue Sep 15, 2016

First attempt at processing covariate information. #46

Closed

dhimmel mentioned this issue Sep 15, 2016

Evaluate performance of covariates at predicting various mutations #47

Merged

dhimmel mentioned this issue Oct 26, 2016

TP53 mutation prediction from metadata #66

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What covariates should we include as features? #21

What covariates should we include as features? #21

dhimmel commented Jul 28, 2016

dhimmel commented Sep 12, 2016

stephenshank commented Sep 15, 2016 •

edited by dhimmel

Loading

dhimmel commented Sep 15, 2016 •

edited

Loading

stephenshank commented Sep 16, 2016

stephenshank commented Sep 16, 2016 •

edited

Loading

dhimmel commented Sep 16, 2016

What covariates should we include as features? #21

What covariates should we include as features? #21

Comments

dhimmel commented Jul 28, 2016

dhimmel commented Sep 12, 2016

stephenshank commented Sep 15, 2016 • edited by dhimmel Loading

dhimmel commented Sep 15, 2016 • edited Loading

stephenshank commented Sep 16, 2016

stephenshank commented Sep 16, 2016 • edited Loading

dhimmel commented Sep 16, 2016

stephenshank commented Sep 15, 2016 •

edited by dhimmel

Loading

dhimmel commented Sep 15, 2016 •

edited

Loading

stephenshank commented Sep 16, 2016 •

edited

Loading