-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
First attempt at processing covariate information. #46
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stephenshank awesome pull request. I requested some changes. Just to be clear about the desired scope of this pull request: the goal should be creating a covariates.tsv
from samples.tsv
(and nothing more).
See my exploratory notebook #46 which encodes some variables as dummies. The processing in #46 isn't intended to become part of the main repository (notebooks in the root directory), but may help you see ways to use get_dummies
. It also would be nice to add n_mutations_log10
as in #46. This will be an important covariate.
|
||
# In[2]: | ||
|
||
samples_df = pd.read_table('download/samples.tsv', index_col=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use os.path.join
to construct filename for windows compatability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
||
categorical_variables = samples_df.columns[[1, 2, 3, 4, 6, 8]] | ||
for variable in categorical_variables: | ||
not_null = ~samples_df[variable].isnull() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pandas has a notnull
function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good to know thanks!
|
||
# In[4]: | ||
|
||
categorical_variables = samples_df.columns[[1, 2, 3, 4, 6, 8]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Best to avoid referring to columns by index, since indexes are unstable across versions and aren't explicit in their meaning. List column names instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
covariates_df.to_csv(path, sep='\t') | ||
|
||
|
||
# Now we'll try to see whether incorporating this information will help to improve the elastic net classifier that was explained at a past meetup. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's save the remainder of this notebook for another notebook/pull request. You'll be able to really clean up your imports with this gone as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I suppose I was a bit too eager for this haha.
samples_df.gender, | ||
mortality, | ||
recurred] | ||
columns_as_dummies = [pd.get_dummies(column) for column in columns_to_process] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get_dumies
can take the whole dataframe at once... No need to do this column by column.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This ties into your comment above about using a big dictionary to rename all the columns. When you use get_dummies
on the whole data frame, it prepends the name of the column to each categorical variable. I.e., the new columns names for the recurred
column will be recurred_has_recurred
and recurred_has_not_recurred
.
Are we okay with this? Without also formatting the disease/organ names, it starts to look a little strange. For instance, organ_of_origin_Thyroid Gland
as a column name. Typing out the big dictionary is doable but slightly tedious... perhaps there is room to be clever with Python or regexes.
It's definitely minor, but any advice on how to proceed for the next PR would be appreciated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My latest suggestion suggests you rename recurred_0
to has_not_recurred
. As far as organ_of_origin_Thyroid
, I'd leave it --- it's ugly but makes it easy to get back to the Xena organ encoding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But if you think you have a better solution, suggest.
plt.legend(loc='lower right'); | ||
|
||
|
||
# Unfortunately, the classifier does not even seem to notice the presence of the covariates. But this could be due to the MAD filtering step, lack of feature normalization, and plenty of other things. Curious to hear thoughts about where to go. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My guess is that the MAD selection is removing the covariates. But even if not, covariates can still be helpful even if they don't improve overall performance -- they can still prevent confounding (faulty feature importance measurements).
pd.get_dummies(samples_df.dead).head(10) | ||
|
||
|
||
# Note that null values are encoded as (0,0) pairs... I was wondering if we were okay with this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's a decent way to handle NaN without having to remove the sample. Obviously not ideal but it works. Another option would be imputation. Maybe a future pull request can work on imputing these variables (especially gender).
|
||
# Note that null values are encoded as (0,0) pairs... I was wondering if we were okay with this. | ||
# | ||
# Let's rename the values in each of these so that they more accurately reflect the underlying variable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about rename after converted to dummies, using DataFrame.rename
?
This way you can rename for all variables at once using one big dictionary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some more comments. Will respond to your get_dummies naming question next.
Also can you track covariates.tsv
in git? It's not too big right?
@@ -56,7 +48,7 @@ | |||
|
|||
# In[5]: | |||
|
|||
categorical_variables = categorical_variables[1:] | |||
del categorical_variables[categorical_variables.index('sample_type')] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See DataFrame.drop
covariates_df = pd.concat(columns_as_dummies+[samples_df[numeric_columns]], axis=1) | ||
mutation_column = np.log1p(number_of_mutations) | ||
covariates_df = pd.concat( | ||
[samples_df[numeric_columns], pd.get_dummies(samples_df[categorical_variables])], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can use the columns
arg of get_dummies
so you don't have to mess with concat. Not sure if maybe you have another motive like column order here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nevertheless, are you okay with the concat
? Everything is properly indexed, so I believe it is correct. If it's not okay, it is not immediately clear to me how to use the columns
argument...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose I could drop the categorical variables once they are converted to dummies, is this what you had in mind?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a problem if you would like to keep it, but I think this replaces 112-113:
covariates_df = pd.get_dummies(samples_df, columns=categorical_variables)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But then it does not have the numeric covariates?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But then it does not have the numeric covariates?
My understanding and experience is that it retains non-categorical columns, just doesn't encode them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, this is indeed correct! It is not at all clear from the pandas documentation...
|
||
# In[8]: | ||
|
||
recurred_map = pd.Series(['has_not_recurred', 'has_recurred'], index=[0.0, 1.0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if you understood my previous comment here. The suggestion is to avoid any value mapping here. Then once you create covariates_df
you can use
renamer = {
'dead_0': 'alive',
'dead_1': 'dead',
# keep going for everything you want to rename
}
covariates_df.rename(columns=renamer, inplace=True)
As an aside, I think it's more succinct to map using a dictionary rather than creating a series.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops... yes it is now clear! Will redo.
|
||
# In[15]: | ||
|
||
sns.jointplot('n_mutations_log1p', 'age_diagnosed', data=covariates_df, kind='reg') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think plot is a duplicate of cell 9 here. Maybe remove this plot but keep the disease specific plots?
Edit: On second thought, I will create a covariates directory with this notebook and put the tsv file there. |
# In[14]: | ||
|
||
path = os.path.join('download', 'covariates.tsv') | ||
covariates_df.to_csv(path, sep='\t') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider float_format='%.5g'
to suppress useless decimal points. Should reduce file size perhaps.
I would suggest perhaps making this notebook 1 in this repo and writing However, I'm starting to think that this work belongs in |
Didn't know if it applies here but I often replace missing value with the predictions from other variables |
Is this information supposed to be used for mutation classification? If this is the case, perhaps it is better to handle the NaNs at the classification stage since each algorithm may have different preference? |
@dhimmel I was actually thinking the same thing. I would be happy to move this over to |
@yl565, are there any sklearn models that handle missing values? What do you mean by "handle the NaNs at the classification stage"?
stephenshank, sounds great.
Okay let's move this to |
@dhimmel |
A recommended way to deal with missing value could be set as default and it would be nice to have the option to get the raw, unpreprocessed data |
Interesting... how about we perform imputation in |
That would be nice! |
* Encodes categorical covariate data from samples. * Includes script file. * Changes names of covariates file, removes exploratory visualizations. * Adds script for covariates. Originally these changes were submitted to machine-learning, but cancer-data was determined to be a more appropriate home. See cognoma/machine-learning#46
Really nice analysis! I thought the age vs. mutation burden by tissue is very interesting... However, not sure it would be safe to impute age this way. In general, my thoughts on imputation are as follows:
That being said, it could still be a interesting exercise and research question to try to impute all missing info. |
All feedback is welcome. Also going to get some discussion going in #21.