-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create the cognoml package to implement an MVP API #51
Changes from 1 commit
a49bfe1
876b813
4c99168
ebc47d7
4fc8baa
7ef78d5
a050db0
eb1b670
5f011f4
527963b
28cb22b
d52c6a2
9930433
4a778d1
e5a44f0
6961e39
ee7733f
2291a0c
66df379
10308e0
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -11,6 +11,7 @@ | |
from sklearn.cross_validation import train_test_split | ||
from sklearn.pipeline import make_pipeline | ||
from sklearn.linear_model import SGDClassifier | ||
from sklearn.feature_selection import VarianceThreshold | ||
from sklearn.preprocessing import StandardScaler | ||
|
||
import utils | ||
|
@@ -85,6 +86,11 @@ def classify(sample_id, mutation_status, **kwargs): | |
performance[part] = utils.value_map(metrics, round, ndigits=5) | ||
performance['cv'] = {'auroc': round(clf_grid.best_score_, 5)} | ||
results['performance'] = performance | ||
|
||
results['model'] = utils.model_info(clf_grid.best_estimator_) | ||
|
||
feature_df = utils.get_feature_df(clf_grid.best_estimator_, X.columns) | ||
results['model']['features'] = utils.df_to_datatables(feature_df) | ||
|
||
results['observations'] = utils.df_to_datatables(obs_df) | ||
return results | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe in the beginning of this script you can describe what There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I generated a JSON schema using (
I can add descriptions for each field here. Do you think that's a good solution. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe just a link to the hippo-output-schema and the genson command would suffice |
||
|
@@ -107,6 +113,7 @@ def classify(sample_id, mutation_status, **kwargs): | |
clf_grid = grid_search.GridSearchCV(estimator=clf, param_grid=param_grid, n_jobs=-1, scoring='roc_auc') | ||
|
||
pipeline = make_pipeline( | ||
VarianceThreshold(), | ||
StandardScaler(), | ||
clf_grid | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yl565 currently I'm retrieving feature names using
X.columns
which will get the feature names ofX
before it enters the pipeline. However, sinceVarianceThreshold
or other feature selection/tranformation steps will alter the feature set, do you know how we can get feature names at the end of the pipeline? In other words, we want the feature names corresponding toclf_grid.best_estimator_.coef_
. I searched for like an hour and couldn't figure this out.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See scikit-learn/scikit-learn#7536