Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First attempt at processing covariate information. #46

Closed
wants to merge 3 commits into from

Conversation

stephenshank
Copy link
Member

All feedback is welcome. Also going to get some discussion going in #21.

Copy link
Member

@dhimmel dhimmel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stephenshank awesome pull request. I requested some changes. Just to be clear about the desired scope of this pull request: the goal should be creating a covariates.tsv from samples.tsv (and nothing more).

See my exploratory notebook #46 which encodes some variables as dummies. The processing in #46 isn't intended to become part of the main repository (notebooks in the root directory), but may help you see ways to use get_dummies. It also would be nice to add n_mutations_log10 as in #46. This will be an important covariate.


# In[2]:

samples_df = pd.read_table('download/samples.tsv', index_col=0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use os.path.join to construct filename for windows compatability.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


categorical_variables = samples_df.columns[[1, 2, 3, 4, 6, 8]]
for variable in categorical_variables:
not_null = ~samples_df[variable].isnull()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pandas has a notnull function

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good to know thanks!


# In[4]:

categorical_variables = samples_df.columns[[1, 2, 3, 4, 6, 8]]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Best to avoid referring to columns by index, since indexes are unstable across versions and aren't explicit in their meaning. List column names instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

covariates_df.to_csv(path, sep='\t')


# Now we'll try to see whether incorporating this information will help to improve the elastic net classifier that was explained at a past meetup.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's save the remainder of this notebook for another notebook/pull request. You'll be able to really clean up your imports with this gone as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I suppose I was a bit too eager for this haha.

samples_df.gender,
mortality,
recurred]
columns_as_dummies = [pd.get_dummies(column) for column in columns_to_process]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_dumies can take the whole dataframe at once... No need to do this column by column.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ties into your comment above about using a big dictionary to rename all the columns. When you use get_dummies on the whole data frame, it prepends the name of the column to each categorical variable. I.e., the new columns names for the recurred column will be recurred_has_recurred and recurred_has_not_recurred.

Are we okay with this? Without also formatting the disease/organ names, it starts to look a little strange. For instance, organ_of_origin_Thyroid Gland as a column name. Typing out the big dictionary is doable but slightly tedious... perhaps there is room to be clever with Python or regexes.

It's definitely minor, but any advice on how to proceed for the next PR would be appreciated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My latest suggestion suggests you rename recurred_0 to has_not_recurred. As far as organ_of_origin_Thyroid, I'd leave it --- it's ugly but makes it easy to get back to the Xena organ encoding.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if you think you have a better solution, suggest.

plt.legend(loc='lower right');


# Unfortunately, the classifier does not even seem to notice the presence of the covariates. But this could be due to the MAD filtering step, lack of feature normalization, and plenty of other things. Curious to hear thoughts about where to go.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My guess is that the MAD selection is removing the covariates. But even if not, covariates can still be helpful even if they don't improve overall performance -- they can still prevent confounding (faulty feature importance measurements).

pd.get_dummies(samples_df.dead).head(10)


# Note that null values are encoded as (0,0) pairs... I was wondering if we were okay with this.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's a decent way to handle NaN without having to remove the sample. Obviously not ideal but it works. Another option would be imputation. Maybe a future pull request can work on imputing these variables (especially gender).


# Note that null values are encoded as (0,0) pairs... I was wondering if we were okay with this.
#
# Let's rename the values in each of these so that they more accurately reflect the underlying variable.
Copy link
Member

@dhimmel dhimmel Sep 15, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about rename after converted to dummies, using DataFrame.rename?

This way you can rename for all variables at once using one big dictionary.

@dhimmel
Copy link
Member

dhimmel commented Sep 15, 2016

In my previous review #46 should be #47. Oops!

Copy link
Member

@dhimmel dhimmel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more comments. Will respond to your get_dummies naming question next.

Also can you track covariates.tsv in git? It's not too big right?

@@ -56,7 +48,7 @@

# In[5]:

categorical_variables = categorical_variables[1:]
del categorical_variables[categorical_variables.index('sample_type')]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

covariates_df = pd.concat(columns_as_dummies+[samples_df[numeric_columns]], axis=1)
mutation_column = np.log1p(number_of_mutations)
covariates_df = pd.concat(
[samples_df[numeric_columns], pd.get_dummies(samples_df[categorical_variables])],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can use the columns arg of get_dummies so you don't have to mess with concat. Not sure if maybe you have another motive like column order here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevertheless, are you okay with the concat? Everything is properly indexed, so I believe it is correct. If it's not okay, it is not immediately clear to me how to use the columns argument...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose I could drop the categorical variables once they are converted to dummies, is this what you had in mind?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a problem if you would like to keep it, but I think this replaces 112-113:

covariates_df = pd.get_dummies(samples_df, columns=categorical_variables)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But then it does not have the numeric covariates?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But then it does not have the numeric covariates?

My understanding and experience is that it retains non-categorical columns, just doesn't encode them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, this is indeed correct! It is not at all clear from the pandas documentation...


# In[8]:

recurred_map = pd.Series(['has_not_recurred', 'has_recurred'], index=[0.0, 1.0])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if you understood my previous comment here. The suggestion is to avoid any value mapping here. Then once you create covariates_df you can use

renamer = {
    'dead_0': 'alive',
    'dead_1': 'dead',
   # keep going for everything you want to rename
}
covariates_df.rename(columns=renamer, inplace=True)

As an aside, I think it's more succinct to map using a dictionary rather than creating a series.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops... yes it is now clear! Will redo.


# In[15]:

sns.jointplot('n_mutations_log1p', 'age_diagnosed', data=covariates_df, kind='reg')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think plot is a duplicate of cell 9 here. Maybe remove this plot but keep the disease specific plots?

@stephenshank
Copy link
Member Author

stephenshank commented Sep 16, 2016

covariates.tsv is < 3 MBs, so we should be able to track. If there is any strong preference for where it should reside, let me know, otherwise I will put it at the top level in the next PR.

Edit: On second thought, I will create a covariates directory with this notebook and put the tsv file there.

# In[14]:

path = os.path.join('download', 'covariates.tsv')
covariates_df.to_csv(path, sep='\t')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider float_format='%.5g' to suppress useless decimal points. Should reduce file size perhaps.

@dhimmel
Copy link
Member

dhimmel commented Sep 16, 2016

I would suggest perhaps making this notebook 1 in this repo and writing covariates.tsv to a data directory.

However, I'm starting to think that this work belongs in cancer-data? @gwaygenomics and @stephenshank, what do you think about adding it the the notebook pipeline in cancer data?

@yl565
Copy link
Contributor

yl565 commented Sep 16, 2016

Didn't know if it applies here but I often replace missing value with the predictions from other variables

@yl565
Copy link
Contributor

yl565 commented Sep 16, 2016

Is this information supposed to be used for mutation classification? If this is the case, perhaps it is better to handle the NaNs at the classification stage since each algorithm may have different preference?

@stephenshank
Copy link
Member Author

@dhimmel I was actually thinking the same thing. I would be happy to move this over to cancer-data as the latest notebook, with covariates.tsv being placed in data so that it is tracked. Your suggestion above reduced file size to ~ 1.2 MB. Does this sound okay?

@dhimmel
Copy link
Member

dhimmel commented Sep 16, 2016

Is this information supposed to be used for mutation classification? If this is the case, perhaps it is better to handle the NaNs at the classification stage since each algorithm may have different preference?

@yl565, are there any sklearn models that handle missing values? What do you mean by "handle the NaNs at the classification stage"?

file size to ~ 1.2 MB. Does this sound okay?

stephenshank, sounds great.

I would be happy to move this over to cancer-data as the latest notebook, with covariates.tsv being placed in data so that it is tracked.

Okay let's move this to cancer-data. You should open a new PR there that references this pull request.

@yl565
Copy link
Contributor

yl565 commented Sep 16, 2016

@dhimmel sklearn.preprocessing.Imputer handles missing values which can be included in the classification pipeline. I mean for people doing mutation classification, they may prefer to process the missing values themselves rather than working with already pre-processed data (at least I am...). There are different ways to deal with the missing data which could interact with classification algorithms differently.

@yl565
Copy link
Contributor

yl565 commented Sep 16, 2016

A recommended way to deal with missing value could be set as default and it would be nice to have the option to get the raw, unpreprocessed data

@dhimmel
Copy link
Member

dhimmel commented Sep 16, 2016

There are different ways to deal with the missing data which could interact with classification algorithms differently.

Interesting... how about we perform imputation in cancer-data resulting in two possible covariate datasets: covariates.tsv and covariates-imputed.tsv (which would use a single imputation method we decide on). Then machine learning users would be free to select either the imputed covariates or perform imputation themselves.

@yl565
Copy link
Contributor

yl565 commented Sep 16, 2016

That would be nice!

dhimmel pushed a commit to cognoma/cancer-data that referenced this pull request Sep 17, 2016
* Encodes categorical covariate data from samples.

* Includes script file.

* Changes names of covariates file, removes exploratory visualizations.

* Adds script for covariates.

Originally these changes were submitted to machine-learning, but cancer-data was determined to be a more appropriate home. See cognoma/machine-learning#46
@gwaybio
Copy link
Member

gwaybio commented Sep 19, 2016

Really nice analysis! I thought the age vs. mutation burden by tissue is very interesting... However, not sure it would be safe to impute age this way.

In general, my thoughts on imputation are as follows:

  1. We should definitely be sure to separate imputed covariates from non-imputed
    • I believe that our target audience (cancer biologists) will not view imputed variables kindly
  2. Survival associated covariates (age, vital status, time to recurrence, survival time) are tough to impute and even tougher to convince people we are imputing them correctly
    • Survival analyses are typically performed using this data and they have well studied methods for dealing with missing or data
  3. Missing data is a common problem for clinical metrics (even more of a problem for electronic health records!) so imputation is more of a research question than a part of our minimum viable product (see Decisions required to reach a minimum viable product #44 )
  4. Imputing gender should be easy enough, lets at least try this one!

That being said, it could still be a interesting exercise and research question to try to impute all missing info.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants