Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What covariates should we include as features? #21

Open
dhimmel opened this issue Jul 28, 2016 · 6 comments
Open

What covariates should we include as features? #21

dhimmel opened this issue Jul 28, 2016 · 6 comments
Labels

Comments

@dhimmel
Copy link
Member

dhimmel commented Jul 28, 2016

In addition to gene expression, we probably should include other information on samples. This discussion will focus on identifying potential covariates and evaluating whether they make sense to include in models. If we don't include the right covariates, confounding is likely to be an issue.

See #8 as a potential example of confounding that may be addressable by adding a mutation load feature.

@dhimmel
Copy link
Member Author

dhimmel commented Sep 12, 2016

@stephenshank began work at the last meetup on creating a covariates.tsv from samples.tsv. @stephenshank any updates?

dhimmel added a commit to dhimmel/machine-learning that referenced this issue Sep 14, 2016
See how well covariates (non-expression features) predict TP53 mutation.

Related to cognoma#8:
General mutation-load does provide some ability to predict mutation status of
TP53.

Partially addresses cognoma#21:
Covariates are extracted from samples.tsv.
@stephenshank
Copy link
Member

stephenshank commented Sep 15, 2016

My first attempt at this can be found at #46, where I simply try to process the samples data to begin using these as features. Immediate issues are...

  1. Are we comfortable with the way categorical NaN's are handled?
  2. How do we want to treat numeric NaN's? The current pipeline breaks when data contains NaN's.
  3. Standardizing column names... perhaps all as lowercase/underscore, which seems to be consistent?

As a longer term issue, I was eager to do some actual machine learning, but my naive attempt at using the existing classifier failed. So it would be great to get some discussion going, regarding what tweaks we expect to enhance performance. Now reading through more issues, I realize I should've tried this on some of the Hippo pathway genes. But I am worried that munging the features together as I have is not an effective approach.

Of course I would also welcome discussion on anything I may have missed... @dhimmel, @gwaygenomics any thoughts?

dhimmel added a commit to dhimmel/machine-learning that referenced this issue Sep 15, 2016
See how well covariates (non-expression features) predict TP53 mutation.

Related to cognoma#8:
General mutation-load does provide some ability to predict mutation status of
TP53.

Partially addresses cognoma#21:
Covariates are extracted from samples.tsv.
dhimmel added a commit to dhimmel/machine-learning that referenced this issue Sep 15, 2016
Creates an explore directory and README for this type of exploratory notebook.

See how well covariates (non-expression features) predict TP53 mutation.

Related to cognoma#8:
General mutation-load does provide some ability to predict mutation status of
TP53.

Partially addresses cognoma#21:
Covariates are extracted from samples.tsv.
@dhimmel
Copy link
Member Author

dhimmel commented Sep 15, 2016

Are we comfortable with the way categorical NaN's are handled?

As per my review comment on #46: yes, although imputation is definitely an option here.

How do we want to treat numeric NaN's? The current pipeline breaks when data contains NaN's.

I know of 3 options: impute, filter observations, or remove variable. Since we don't want to start hemorrhaging samples, I don't think we should filter many observations. So maybe we can assess imputation/removal on a per-variable basis. i.e. will it impute, if so impute and keep, else remove.

Standardizing column names... perhaps all as lowercase/underscore, which seems to be consistent?

Personally, I make all_lowercase_underscore_separated variable names. However, I do see a benefit in not messing with Xena names unless we store a reversible mapping. In other words, if using foreign data, it's sometimes better to use dirty column names than break interoperability. However, lot's of these variables have already been changed or recoded in cancer-data, so interoperability is less of a worry for those variables.

@stephenshank
Copy link
Member

Personally, I make all_lowercase_underscore_separated variable names

I'm tempted to do this, just because I consider this part of clean data for developers. It can get frustrating trying to autocomplete variables that you know are there, only to remember that THOSE variables are upper case.

I hope to have some progress on imputing the numeric variables for the next PR. The only strategies I know for this are either 1) filling in the most common values among similar cases, or 2) exploring correlations. For instance, we should be able to impute the missing age_diagnosed with the correlation that you found with the number of mutations, perhaps through a linear regression. Any other strategies are welcome.

@stephenshank
Copy link
Member

stephenshank commented Sep 16, 2016

@dhimmel I made the proposed changes and started to do some exploratory visualization for imputation, which I've pushed for review. Also note that the notebook and the script are in two separate commits, since I forgot to convert before I pushed 😬. Please let me know if there are any more revisions, I am happy to continue working on this.

Also, any suggestions for how to carry out the imputation are welcome... I've made some of my own in the notebook.

@dhimmel
Copy link
Member Author

dhimmel commented Sep 16, 2016

@stephenshank, let's keep discussion related to PR #46 on the actual pull request. I didn't see your latest two comments until after my most recent review ):

Let's start a new issue for covariate imputation and deal with it in a future pull request.

dhimmel added a commit that referenced this issue Sep 22, 2016
* Evaluate performance of covariates on TP53

Creates an explore directory and README for this type of exploratory notebook.

See how well covariates (non-expression features) predict TP53 mutation.

Related to #8:
General mutation-load does provide some ability to predict mutation status of
TP53.

Partially addresses #21:
Covariates are extracted from samples.tsv.

* Evaluate more covariate/mutation combinations

Evaluate covariate-only classifiers for the interesting mutations compiled in
cognoma/cancer-data#22 (comment).

Switches to an expand grid system for evaluating all possible covariate
combinations.

Plot performance of all covariates on each mutation.

Switches to `covariates.tsv` created in
cognoma/cancer-data#24 for encoded covariates.

* Export clean notebook to script

* Address review comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants