Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create the cognoml package to implement an MVP API #51

Merged
merged 20 commits into from
Oct 11, 2016
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions cognoml/analysis.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
from sklearn.cross_validation import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import SGDClassifier
from sklearn.feature_selection import VarianceThreshold
from sklearn.preprocessing import StandardScaler

import utils
Expand Down Expand Up @@ -85,6 +86,11 @@ def classify(sample_id, mutation_status, **kwargs):
performance[part] = utils.value_map(metrics, round, ndigits=5)
performance['cv'] = {'auroc': round(clf_grid.best_score_, 5)}
results['performance'] = performance

results['model'] = utils.model_info(clf_grid.best_estimator_)

feature_df = utils.get_feature_df(clf_grid.best_estimator_, X.columns)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yl565 currently I'm retrieving feature names using X.columns which will get the feature names of X before it enters the pipeline. However, since VarianceThreshold or other feature selection/tranformation steps will alter the feature set, do you know how we can get feature names at the end of the pipeline? In other words, we want the feature names corresponding to clf_grid.best_estimator_.coef_. I searched for like an hour and couldn't figure this out.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

results['model']['features'] = utils.df_to_datatables(feature_df)

results['observations'] = utils.df_to_datatables(obs_df)
return results
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe in the beginning of this script you can describe what results should look like? I am having some difficulty interpreting what results actually entails and its format

Copy link
Member Author

@dhimmel dhimmel Sep 26, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I generated a JSON schema using (hippo-output-schema.json):

genson --indent=2 data/api/hippo-output.json  > data/api/hippo-output-schema.json

I can add descriptions for each field here. Do you think that's a good solution.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just a link to the hippo-output-schema and the genson command would suffice

Expand All @@ -107,6 +113,7 @@ def classify(sample_id, mutation_status, **kwargs):
clf_grid = grid_search.GridSearchCV(estimator=clf, param_grid=param_grid, n_jobs=-1, scoring='roc_auc')

pipeline = make_pipeline(
VarianceThreshold(),
StandardScaler(),
clf_grid
)
Expand Down
15 changes: 15 additions & 0 deletions cognoml/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -97,3 +97,18 @@ def threshold_metrics(y_true, y_pred):
metrics['auroc'] = sklearn.metrics.roc_auc_score(y_true, y_pred)
metrics['auprc'] = sklearn.metrics.average_precision_score(y_true, y_pred)
return metrics

def model_info(estimator):
model = collections.OrderedDict()
model['class'] = type(estimator).__name__
model['module'] = estimator.__module__
model['parameters'] = estimator.get_params()
return model

def get_feature_df(estimator, features):
coefficients, = estimator.coef_
feature_df = pd.DataFrame.from_items([
('feature', features),
('coefficient', coefficients),
])
return feature_df
91 changes: 91 additions & 0 deletions data/api/hippo-output.json
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,97 @@
"auroc": 0.62524
}
},
"model": {
"class": "SGDClassifier",
"module": "sklearn.linear_model.stochastic_gradient",
"parameters": {
"warm_start": false,
"alpha": 0.1,
"random_state": 0,
"learning_rate": "optimal",
"shuffle": true,
"epsilon": 0.1,
"power_t": 0.5,
"n_iter": 5,
"penalty": "elasticnet",
"class_weight": "balanced",
"loss": "log",
"n_jobs": 1,
"eta0": 0.0,
"fit_intercept": true,
"average": false,
"l1_ratio": 0.0,
"verbose": 0
},
"features": {
"columns": [
"feature",
"coefficient"
],
"data": [
[
"1421",
-0.04357
],
[
"5203",
0.10076
],
[
"5818",
0.09927
],
[
"9875",
0.07751
],
[
"10675",
0.03264
],
[
"10919",
0.02275
],
[
"23262",
-0.02254
],
[
"23467",
-0.21388
],
[
"54941",
0.0073
],
[
"79622",
0.00158
],
[
"147746",
-0.10429
],
[
"255167",
-0.03445
],
[
"284123",
-0.0188
],
[
"646851",
-0.05939
],
[
"728689",
0.00557
]
]
}
},
"observations": {
"columns": [
"sample_id",
Expand Down