First attempt at processing covariate information. #46

stephenshank · 2016-09-15T13:37:01Z

All feedback is welcome. Also going to get some discussion going in #21.

dhimmel

@stephenshank awesome pull request. I requested some changes. Just to be clear about the desired scope of this pull request: the goal should be creating a covariates.tsv from samples.tsv (and nothing more).

See my exploratory notebook #46 which encodes some variables as dummies. The processing in #46 isn't intended to become part of the main repository (notebooks in the root directory), but may help you see ways to use get_dummies. It also would be nice to add n_mutations_log10 as in #46. This will be an important covariate.

dhimmel · 2016-09-15T14:50:12Z

scripts/4.Covariates.py

+
+# In[2]:
+
+samples_df = pd.read_table('download/samples.tsv', index_col=0)


Use os.path.join to construct filename for windows compatability.

dhimmel · 2016-09-15T14:51:28Z

scripts/4.Covariates.py

+
+categorical_variables = samples_df.columns[[1, 2, 3, 4, 6, 8]]
+for variable in categorical_variables:
+    not_null = ~samples_df[variable].isnull()


Pandas has a notnull function

Very good to know thanks!

dhimmel · 2016-09-15T14:55:17Z

scripts/4.Covariates.py

+
+# In[4]:
+
+categorical_variables = samples_df.columns[[1, 2, 3, 4, 6, 8]]


Best to avoid referring to columns by index, since indexes are unstable across versions and aren't explicit in their meaning. List column names instead?

dhimmel · 2016-09-15T14:58:17Z

scripts/4.Covariates.py

+covariates_df.to_csv(path, sep='\t')
+
+
+# Now we'll try to see whether incorporating this information will help to improve the elastic net classifier that was explained at a past meetup.


Let's save the remainder of this notebook for another notebook/pull request. You'll be able to really clean up your imports with this gone as well.

Okay, I suppose I was a bit too eager for this haha.

dhimmel · 2016-09-15T14:59:05Z

scripts/4.Covariates.py

+                      samples_df.gender,
+                      mortality,
+                      recurred]
+columns_as_dummies = [pd.get_dummies(column) for column in columns_to_process]


get_dumies can take the whole dataframe at once... No need to do this column by column.

This ties into your comment above about using a big dictionary to rename all the columns. When you use get_dummies on the whole data frame, it prepends the name of the column to each categorical variable. I.e., the new columns names for the recurred column will be recurred_has_recurred and recurred_has_not_recurred.

Are we okay with this? Without also formatting the disease/organ names, it starts to look a little strange. For instance, organ_of_origin_Thyroid Gland as a column name. Typing out the big dictionary is doable but slightly tedious... perhaps there is room to be clever with Python or regexes.

It's definitely minor, but any advice on how to proceed for the next PR would be appreciated.

My latest suggestion suggests you rename recurred_0 to has_not_recurred. As far as organ_of_origin_Thyroid, I'd leave it --- it's ugly but makes it easy to get back to the Xena organ encoding.

But if you think you have a better solution, suggest.

dhimmel · 2016-09-15T15:01:03Z

scripts/4.Covariates.py

+plt.legend(loc='lower right');
+
+
+# Unfortunately, the classifier does not even seem to notice the presence of the covariates. But this could be due to the MAD filtering step, lack of feature normalization, and plenty of other things. Curious to hear thoughts about where to go.


My guess is that the MAD selection is removing the covariates. But even if not, covariates can still be helpful even if they don't improve overall performance -- they can still prevent confounding (faulty feature importance measurements).

dhimmel · 2016-09-15T15:18:38Z

scripts/4.Covariates.py

+pd.get_dummies(samples_df.dead).head(10)
+
+
+# Note that null values are encoded as (0,0) pairs... I was wondering if we were okay with this.


I think that's a decent way to handle NaN without having to remove the sample. Obviously not ideal but it works. Another option would be imputation. Maybe a future pull request can work on imputing these variables (especially gender).

dhimmel · 2016-09-15T15:19:40Z

scripts/4.Covariates.py

+
+# Note that null values are encoded as (0,0) pairs... I was wondering if we were okay with this.
+# 
+# Let's rename the values in each of these so that they more accurately reflect the underlying variable.


How about rename after converted to dummies, using DataFrame.rename?

This way you can rename for all variables at once using one big dictionary.

dhimmel · 2016-09-15T15:23:01Z

In my previous review #46 should be #47. Oops!

dhimmel

Some more comments. Will respond to your get_dummies naming question next.

Also can you track covariates.tsv in git? It's not too big right?

dhimmel · 2016-09-16T13:59:14Z

scripts/4.Covariates.py

@@ -56,7 +48,7 @@

 # In[5]:

-categorical_variables = categorical_variables[1:]
+del categorical_variables[categorical_variables.index('sample_type')]


See DataFrame.drop

dhimmel · 2016-09-16T14:01:39Z

scripts/4.Covariates.py

-covariates_df = pd.concat(columns_as_dummies+[samples_df[numeric_columns]], axis=1)
+mutation_column = np.log1p(number_of_mutations)
+covariates_df = pd.concat(
+    [samples_df[numeric_columns], pd.get_dummies(samples_df[categorical_variables])],


I think you can use the columns arg of get_dummies so you don't have to mess with concat. Not sure if maybe you have another motive like column order here.

Nevertheless, are you okay with the concat? Everything is properly indexed, so I believe it is correct. If it's not okay, it is not immediately clear to me how to use the columns argument...

I suppose I could drop the categorical variables once they are converted to dummies, is this what you had in mind?

Not a problem if you would like to keep it, but I think this replaces 112-113:

covariates_df = pd.get_dummies(samples_df, columns=categorical_variables)

But then it does not have the numeric covariates?

But then it does not have the numeric covariates?

My understanding and experience is that it retains non-categorical columns, just doesn't encode them.

Sorry, this is indeed correct! It is not at all clear from the pandas documentation...

dhimmel · 2016-09-16T14:10:43Z

scripts/4.Covariates.py

+
+# In[8]:
+
+recurred_map = pd.Series(['has_not_recurred', 'has_recurred'], index=[0.0, 1.0])


Not sure if you understood my previous comment here. The suggestion is to avoid any value mapping here. Then once you create covariates_df you can use

renamer = { 'dead_0': 'alive', 'dead_1': 'dead', # keep going for everything you want to rename } covariates_df.rename(columns=renamer, inplace=True)

As an aside, I think it's more succinct to map using a dictionary rather than creating a series.

Oops... yes it is now clear! Will redo.

dhimmel · 2016-09-16T14:12:45Z

scripts/4.Covariates.py

+
+# In[15]:
+
+sns.jointplot('n_mutations_log1p', 'age_diagnosed', data=covariates_df, kind='reg')


I think plot is a duplicate of cell 9 here. Maybe remove this plot but keep the disease specific plots?

stephenshank · 2016-09-16T15:03:12Z

covariates.tsv is < 3 MBs, so we should be able to track. If there is any strong preference for where it should reside, let me know, otherwise I will put it at the top level in the next PR.

Edit: On second thought, I will create a covariates directory with this notebook and put the tsv file there.

dhimmel · 2016-09-16T15:39:20Z

scripts/4.Covariates.py

+# In[14]:
+
+path = os.path.join('download', 'covariates.tsv')
+covariates_df.to_csv(path, sep='\t')


consider float_format='%.5g' to suppress useless decimal points. Should reduce file size perhaps.

dhimmel · 2016-09-16T15:41:03Z

I would suggest perhaps making this notebook 1 in this repo and writing covariates.tsv to a data directory.

However, I'm starting to think that this work belongs in cancer-data? @gwaygenomics and @stephenshank, what do you think about adding it the the notebook pipeline in cancer data?

yl565 · 2016-09-16T16:39:10Z

Didn't know if it applies here but I often replace missing value with the predictions from other variables

yl565 · 2016-09-16T16:48:49Z

Is this information supposed to be used for mutation classification? If this is the case, perhaps it is better to handle the NaNs at the classification stage since each algorithm may have different preference?

stephenshank · 2016-09-16T17:10:49Z

@dhimmel I was actually thinking the same thing. I would be happy to move this over to cancer-data as the latest notebook, with covariates.tsv being placed in data so that it is tracked. Your suggestion above reduced file size to ~ 1.2 MB. Does this sound okay?

dhimmel · 2016-09-16T18:22:03Z

Is this information supposed to be used for mutation classification? If this is the case, perhaps it is better to handle the NaNs at the classification stage since each algorithm may have different preference?

@yl565, are there any sklearn models that handle missing values? What do you mean by "handle the NaNs at the classification stage"?

file size to ~ 1.2 MB. Does this sound okay?

stephenshank, sounds great.

I would be happy to move this over to cancer-data as the latest notebook, with covariates.tsv being placed in data so that it is tracked.

Okay let's move this to cancer-data. You should open a new PR there that references this pull request.

yl565 · 2016-09-16T21:11:14Z

@dhimmel sklearn.preprocessing.Imputer handles missing values which can be included in the classification pipeline. I mean for people doing mutation classification, they may prefer to process the missing values themselves rather than working with already pre-processed data (at least I am...). There are different ways to deal with the missing data which could interact with classification algorithms differently.

yl565 · 2016-09-16T21:29:33Z

A recommended way to deal with missing value could be set as default and it would be nice to have the option to get the raw, unpreprocessed data

dhimmel · 2016-09-16T21:58:40Z

There are different ways to deal with the missing data which could interact with classification algorithms differently.

Interesting... how about we perform imputation in cancer-data resulting in two possible covariate datasets: covariates.tsv and covariates-imputed.tsv (which would use a single imputation method we decide on). Then machine learning users would be free to select either the imputed covariates or perform imputation themselves.

yl565 · 2016-09-16T22:06:37Z

That would be nice!

* Encodes categorical covariate data from samples. * Includes script file. * Changes names of covariates file, removes exploratory visualizations. * Adds script for covariates. Originally these changes were submitted to machine-learning, but cancer-data was determined to be a more appropriate home. See cognoma/machine-learning#46

gwaybio · 2016-09-19T20:56:35Z

Really nice analysis! I thought the age vs. mutation burden by tissue is very interesting... However, not sure it would be safe to impute age this way.

In general, my thoughts on imputation are as follows:

We should definitely be sure to separate imputed covariates from non-imputed
- I believe that our target audience (cancer biologists) will not view imputed variables kindly
Survival associated covariates (age, vital status, time to recurrence, survival time) are tough to impute and even tougher to convince people we are imputing them correctly
- Survival analyses are typically performed using this data and they have well studied methods for dealing with missing or data
Missing data is a common problem for clinical metrics (even more of a problem for electronic health records!) so imputation is more of a research question than a part of our minimum viable product (see Decisions required to reach a minimum viable product #44 )
Imputing gender should be easy enough, lets at least try this one!

That being said, it could still be a interesting exercise and research question to try to impute all missing info.

First attempt at processing covariate information.

a5d3209

stephenshank mentioned this pull request Sep 15, 2016

What covariates should we include as features? #21

Open

dhimmel requested changes Sep 15, 2016

View reviewed changes

stephenshank added 2 commits September 16, 2016 09:03

Exploratory visualization for imputations, removes machine learning.

788a97d

Includes script.

8ec276a

dhimmel requested changes Sep 16, 2016

View reviewed changes

dhimmel reviewed Sep 16, 2016

View reviewed changes

dhimmel closed this Sep 16, 2016

stephenshank mentioned this pull request Sep 16, 2016

Encodes categorical covariates from samples dataset. cognoma/cancer-data#24

Merged


		# In[2]:

		samples_df = pd.read_table('download/samples.tsv', index_col=0)


		# In[4]:

		categorical_variables = samples_df.columns[[1, 2, 3, 4, 6, 8]]

		covariates_df.to_csv(path, sep='\t')


		# Now we'll try to see whether incorporating this information will help to improve the elastic net classifier that was explained at a past meetup.

		plt.legend(loc='lower right');


		# Unfortunately, the classifier does not even seem to notice the presence of the covariates. But this could be due to the MAD filtering step, lack of feature normalization, and plenty of other things. Curious to hear thoughts about where to go.

		pd.get_dummies(samples_df.dead).head(10)


		# Note that null values are encoded as (0,0) pairs... I was wondering if we were okay with this.


		# In[8]:

		recurred_map = pd.Series(['has_not_recurred', 'has_recurred'], index=[0.0, 1.0])


		# In[15]:

		sns.jointplot('n_mutations_log1p', 'age_diagnosed', data=covariates_df, kind='reg')

First attempt at processing covariate information. #46

First attempt at processing covariate information. #46

Conversation

stephenshank commented Sep 15, 2016

dhimmel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dhimmel Sep 15, 2016 • edited Loading

Choose a reason for hiding this comment

dhimmel commented Sep 15, 2016

dhimmel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stephenshank commented Sep 16, 2016 • edited Loading

Choose a reason for hiding this comment

dhimmel commented Sep 16, 2016

yl565 commented Sep 16, 2016

yl565 commented Sep 16, 2016

stephenshank commented Sep 16, 2016

dhimmel commented Sep 16, 2016

yl565 commented Sep 16, 2016

yl565 commented Sep 16, 2016

dhimmel commented Sep 16, 2016

yl565 commented Sep 16, 2016

gwaybio commented Sep 19, 2016

dhimmel Sep 15, 2016 •

edited

Loading

stephenshank commented Sep 16, 2016 •

edited

Loading