-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create the cognoml package to implement an MVP API #51
Conversation
From the `cognoml` directory, ran: ``` python analysis.py > ../data/api/hippo-output.json ```
Also filter zero-variance features.
|
||
results['model'] = utils.model_info(clf_grid.best_estimator_) | ||
|
||
feature_df = utils.get_feature_df(clf_grid.best_estimator_, X.columns) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yl565 currently I'm retrieving feature names using X.columns
which will get the feature names of X
before it enters the pipeline. However, since VarianceThreshold
or other feature selection/tranformation steps will alter the feature set, do you know how we can get feature names at the end of the pipeline? In other words, we want the feature names corresponding to clf_grid.best_estimator_.coef_
. I searched for like an hour and couldn't figure this out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unselected observations (samples in the dataset that were not selected by the user) are now returned. These observations receive predictions but are missing (-1 encoded) for fields such as `testing` and `status`. Sorted model parameters by key.
('sample_id', X_whole.index), | ||
('predicted_status', pipeline.predict(X_whole)), | ||
('predicted_score', pipeline.decision_function(X_whole)), | ||
('predicted_prob', pipeline.predict_proba(X_whole)[:, 1]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yl565 what's the best way to see if a pipeline supports predict_proba
? We can upgrade to sklearn 18 once that's released, if that will make things easier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See d52c6a2 for my solution
This pull request is ready for some review. Suggesting @yl565 @awm33 @cgreene @gwaygenomics. The package name is The package is pip installable, but there are outstanding issues:
To run the example, use: python cognoml/main.py > data/api/hippo-output.json |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work. I only made a couple minor comments. I can also confirm that the example data/api/hippo-output.json
works as expected!
|
||
performance = collections.OrderedDict() | ||
for part, df in ('training', obs_train_df), ('testing', obs_test_df): | ||
y_true=df.status |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pep8 spacing updates needed
@@ -111,17 +112,17 @@ | |||
|
|||
# In[9]: | |||
|
|||
get_ipython().run_cell_magic('time', '', "path = os.path.join('data', 'expression-matrix.tsv.bz2')\nX = pd.read_table(path, index_col=0)") | |||
get_ipython().run_cell_magic('time', '', "path = os.path.join('download', 'expression-matrix.tsv.bz2')\nX = pd.read_table(path, index_col=0)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if someone were to nbconvert the ipynb
file will this script be overwritten?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't be as the upstream changes to 3.TCGA-MLexample_Pathway.ipynb
are part of this pull request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah ok, see it now
X = X_whole.loc[obs_df.sample_id, :] | ||
y = obs_df.status | ||
|
||
X_train, X_test, y_train, y_test = train_test_split( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confirming that you're deciding not to stratify based on disease too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, stratification by disease could also make sense. Currently, sample/covariate info is not part of this pull request. I think it probably should be added before the first release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, I was at talk by Olivier Elemento - he was building models for a different purpose (predict immunotherapy responders) but was adjusting for mutation burden as a covariate. We may want to consider checking out his stuff and adjusting for burden too
results['model']['features'] = utils.df_to_datatables(feature_df) | ||
|
||
results['observations'] = utils.df_to_datatables(obs_df) | ||
return results |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe in the beginning of this script you can describe what results
should look like? I am having some difficulty interpreting what results actually entails and its format
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I generated a JSON schema using (hippo-output-schema.json
):
genson --indent=2 data/api/hippo-output.json > data/api/hippo-output-schema.json
I can add descriptions for each field here. Do you think that's a good solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just a link to the hippo-output-schema and the genson command would suffice
ok, i think this looks good to me. Although you may want to wait on @yl565 for those specific questions. |
Fix pipeline according to: scikit-learn/scikit-learn#7536 (comment) Extract selected feature names according to: scikit-learn/scikit-learn#7536 (comment)
Used `git checkout --theirs` to resolve conflicts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice comments and flow - easy to review. I didn't try running this iteration yet (I believe I tested one previously). But I can run if you're not pressed for time.
I only had some minor fixes and general questions.
pip install --editable . | ||
``` | ||
|
||
Make sure the `cognoma-machine-learning` environment is activated first, so the `cognoml` is only installed for this environment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to say this requirement before pip install
?
""" | ||
Read data. | ||
""" | ||
v_dir = download_files(directory=data_directory, artile_id=3487685, version=version) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo in article_id
also, is there a reason you are not providing article_id
and directory
as function arguments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, is there a reason you are not providing article_id and directory as function arguments?
What do you mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the read_data()
function has only a single argument: version
- you have download_files
here accepting two variables data_directory
and 3487685
that are blind to the function.
If this was the functionality you are intending, I would recommend doing this instead:
def read_data(directory=data_directory, article_id=3487685, version=None):
"""
Read data.
"""
v_dir = download_files(directory=directory, article_id=article_id, version=version)
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
data_directory
is a module scope variable that should be set using:
cognoml.analysis.data_directory = 'new_path'
Using the default definition of directory=data_directory
will break the above functionality.
There is no support for changing article_id
currently.
So while I see your point, I think the changes add more complexity without any additional functionality at the moment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gwaygenomics note that:
Python’s default arguments are evaluated once when the function is defined, not each time the function is called (like it is in say, Ruby).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For mutable defaults only?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For mutable defaults only?
No for all defaults. Their value is evaluated upon definition. This leads to an odd behavior for mutable defaults where they can be modified with each function call. Immutables don't have this issue, but are still evaluated upon definition.
y = obs_df.status | ||
|
||
X_train, X_test, y_train, y_test = train_test_split( | ||
X, y, test_size=0.1, random_state=0, stratify=y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again, this is always set to 10%?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah unless you have another heuristic that you think is better. Eventually we will probably need to smarten up here and potentially refuse to classify problems with less than a certain number of positives.
#'classify__alpha': 10.0 ** np.arange(-3, 2), | ||
#'classify__l1_ratio': [0.0, 0.1, 0.2, 0.5, 0.8, 0.9, 1.0], | ||
'classify__alpha': 10.0 ** np.arange(-1, 1), | ||
'classify__l1_ratio': [0.0, 1.0], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lasso or Ridge only?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixing the pipeline (#54) really slowed things down because the tranformation steps are now performed separately for each CV fold. Therefore, I really cut down the grid. In retrospect, this grid is probably too small -- I'll increase it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
version_to_url = {d['version']: d['url'] for d in response.json()} | ||
return version_to_url | ||
|
||
def download_files(directory, artile_id=3487685, version=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another typo in article_id
str | ||
The version-specific DOI corresponding to the downloaded data. | ||
""" | ||
version_to_url = get_article_versions(artile_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
article_id
|
||
def download_files(directory, artile_id=3487685, version=None): | ||
""" | ||
Download files for a specific figshare article_id and version to the specified directory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a usage note that it won't really download to the specified dictionary - it will append the version too
`json.dump` function with `cls=JSONEncoder`. | ||
""" | ||
obj_str = pd.json.dumps(obj) | ||
print(obj_str) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
making sure you want to print this here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this main function is currently only for running a single example / test case. Actual users will call the cognoml.analysis.classify()
function.
@@ -0,0 +1,323 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't read this file
Does not address "Lasso or Ridge only?"
Do not optimize `l1_ratio`. Instead use the default of 0.15. Search a denser grid for `alpha`. See cognoma#56
I'll remove WIP (work in progress) when this pull request is complete.
See #31 for API design discussion.
See #44 for MVP (minimal viable product) feature inclusion decisions.