Create the cognoml package to implement an MVP API #51

dhimmel · 2016-09-20T16:26:11Z

I'll remove WIP (work in progress) when this pull request is complete.

See #31 for API design discussion.

See #44 for MVP (minimal viable product) feature inclusion decisions.

From the `cognoml` directory, ran: ``` python analysis.py > ../data/api/hippo-output.json ```

Also filter zero-variance features.

dhimmel · 2016-09-20T20:36:22Z

cognoml/analysis.py

+
+    results['model'] = utils.model_info(clf_grid.best_estimator_)
+
+    feature_df = utils.get_feature_df(clf_grid.best_estimator_, X.columns)


@yl565 currently I'm retrieving feature names using X.columns which will get the feature names of X before it enters the pipeline. However, since VarianceThreshold or other feature selection/tranformation steps will alter the feature set, do you know how we can get feature names at the end of the pipeline? In other words, we want the feature names corresponding to clf_grid.best_estimator_.coef_. I searched for like an hour and couldn't figure this out.

See scikit-learn/scikit-learn#7536

Unselected observations (samples in the dataset that were not selected by the user) are now returned. These observations receive predictions but are missing (-1 encoded) for fields such as `testing` and `status`. Sorted model parameters by key.

dhimmel · 2016-09-21T16:00:51Z

cognoml/analysis.py

+        ('sample_id', X_whole.index),
+        ('predicted_status', pipeline.predict(X_whole)),
+        ('predicted_score', pipeline.decision_function(X_whole)),
+        ('predicted_prob', pipeline.predict_proba(X_whole)[:, 1]),


@yl565 what's the best way to see if a pipeline supports predict_proba? We can upgrade to sklearn 18 once that's released, if that will make things easier.

See d52c6a2 for my solution

dhimmel · 2016-09-26T18:08:04Z

This pull request is ready for some review. Suggesting @yl565 @awm33 @cgreene @gwaygenomics.

The package name is cognoml. Feedback on name welcome.

The package is pip installable, but there are outstanding issues:

data management: I'm not sure the best way to have our package access the data. Should you always have to pass a path for the directory that contains the data? If the data doesn't exist in that directory, it can be downloaded.
algorithm support: currently only SGDClassifier is supported. It's not clear how we want to modularly support different pipelines. If we want to support different algos, how should this info get passed to this module?
feature name bug
currently the returned JSON may contain numpy values which break the builtin JSON encoder. Alternatively, cognoml.utils.JSONEncoder should work. However, if writing to the database, maybe we should just sanitize all JSON values before returning.

To run the example, use:

python cognoml/main.py > data/api/hippo-output.json

gwaybio

Nice work. I only made a couple minor comments. I can also confirm that the example data/api/hippo-output.json works as expected!

gwaybio · 2016-09-26T19:30:47Z

cognoml/analysis.py

+
+    performance = collections.OrderedDict()
+    for part, df in ('training', obs_train_df), ('testing', obs_test_df):
+        y_true=df.status


pep8 spacing updates needed

gwaybio · 2016-09-26T19:33:24Z

scripts/3.TCGA-MLexample_Pathway.py

@@ -111,17 +112,17 @@

 # In[9]:

-get_ipython().run_cell_magic('time', '', "path = os.path.join('data', 'expression-matrix.tsv.bz2')\nX = pd.read_table(path, index_col=0)")
+get_ipython().run_cell_magic('time', '', "path = os.path.join('download', 'expression-matrix.tsv.bz2')\nX = pd.read_table(path, index_col=0)")


if someone were to nbconvert the ipynb file will this script be overwritten?

Shouldn't be as the upstream changes to 3.TCGA-MLexample_Pathway.ipynb are part of this pull request.

ah ok, see it now

gwaybio · 2016-09-26T19:34:42Z

cognoml/analysis.py

+    X = X_whole.loc[obs_df.sample_id, :]
+    y = obs_df.status
+
+    X_train, X_test, y_train, y_test = train_test_split(


Confirming that you're deciding not to stratify based on disease too?

Ah, stratification by disease could also make sense. Currently, sample/covariate info is not part of this pull request. I think it probably should be added before the first release.

also, I was at talk by Olivier Elemento - he was building models for a different purpose (predict immunotherapy responders) but was adjusting for mutation burden as a covariate. We may want to consider checking out his stuff and adjusting for burden too

gwaybio · 2016-09-26T19:37:21Z

cognoml/analysis.py

+    results['model']['features'] = utils.df_to_datatables(feature_df)
+
+    results['observations'] = utils.df_to_datatables(obs_df)
+    return results


Maybe in the beginning of this script you can describe what results should look like? I am having some difficulty interpreting what results actually entails and its format

I generated a JSON schema using (hippo-output-schema.json):

genson --indent=2 data/api/hippo-output.json > data/api/hippo-output-schema.json

I can add descriptions for each field here. Do you think that's a good solution.

Maybe just a link to the hippo-output-schema and the genson command would suffice

gwaybio · 2016-09-27T13:15:39Z

ok, i think this looks good to me. Although you may want to wait on @yl565 for those specific questions.

Meant to address https://git.io/vPvtI

Fix pipeline according to: scikit-learn/scikit-learn#7536 (comment) Extract selected feature names according to: scikit-learn/scikit-learn#7536 (comment)

Used `git checkout --theirs` to resolve conflicts

dhimmel · 2016-10-10T19:59:33Z

I propose we merge this sooner rather than later. The pull request is already quite large.

@gwaygenomics, @cgreene, @yl565 does one of you want to submit a review for up to 2291a0c?

gwaybio

nice comments and flow - easy to review. I didn't try running this iteration yet (I believe I tested one previously). But I can run if you're not pressed for time.

I only had some minor fixes and general questions.

gwaybio · 2016-10-10T20:02:07Z

README.md

+pip install --editable .
+```
+
+Make sure the `cognoma-machine-learning` environment is activated first, so the `cognoml` is only installed for this environment.


Would it make sense to say this requirement before pip install?

gwaybio · 2016-10-10T20:03:27Z

cognoml/analysis.py

+    """
+    Read data.
+    """
+    v_dir = download_files(directory=data_directory, artile_id=3487685, version=version)


typo in article_id

also, is there a reason you are not providing article_id and directory as function arguments?

also, is there a reason you are not providing article_id and directory as function arguments?

What do you mean?

the read_data() function has only a single argument: version - you have download_files here accepting two variables data_directory and 3487685 that are blind to the function.

If this was the functionality you are intending, I would recommend doing this instead:

def read_data(directory=data_directory, article_id=3487685, version=None): """ Read data. """ v_dir = download_files(directory=directory, article_id=article_id, version=version) ...

data_directory is a module scope variable that should be set using:

cognoml.analysis.data_directory = 'new_path'

Using the default definition of directory=data_directory will break the above functionality.

There is no support for changing article_id currently.

So while I see your point, I think the changes add more complexity without any additional functionality at the moment.

@gwaygenomics note that:

Python’s default arguments are evaluated once when the function is defined, not each time the function is called (like it is in say, Ruby).

For mutable defaults only?

For mutable defaults only?

No for all defaults. Their value is evaluated upon definition. This leads to an odd behavior for mutable defaults where they can be modified with each function call. Immutables don't have this issue, but are still evaluated upon definition.

gwaybio · 2016-10-10T20:04:25Z

cognoml/analysis.py

+    y = obs_df.status
+
+    X_train, X_test, y_train, y_test = train_test_split(
+        X, y, test_size=0.1, random_state=0, stratify=y)


again, this is always set to 10%?

Yeah unless you have another heuristic that you think is better. Eventually we will probably need to smarten up here and potentially refuse to classify problems with less than a certain number of positives.

gwaybio · 2016-10-10T20:07:44Z

cognoml/classifiers/logistic_regression.py

+    #'classify__alpha': 10.0 ** np.arange(-3, 2),
+    #'classify__l1_ratio': [0.0, 0.1, 0.2, 0.5, 0.8, 0.9, 1.0],
+    'classify__alpha': 10.0 ** np.arange(-1, 1),
+    'classify__l1_ratio': [0.0, 1.0],


Lasso or Ridge only?

Fixing the pipeline (#54) really slowed things down because the tranformation steps are now performed separately for each CV fold. Therefore, I really cut down the grid. In retrospect, this grid is probably too small -- I'll increase it.

See #56 and 10308e0

gwaybio · 2016-10-10T20:08:28Z

cognoml/figshare.py

+    version_to_url = {d['version']: d['url'] for d in response.json()}
+    return version_to_url
+
+def download_files(directory, artile_id=3487685, version=None):


another typo in article_id

gwaybio · 2016-10-10T20:09:01Z

cognoml/figshare.py

+    str
+        The version-specific DOI corresponding to the downloaded data.
+    """
+    version_to_url = get_article_versions(artile_id)


gwaybio · 2016-10-10T20:10:14Z

cognoml/figshare.py

+
+def download_files(directory, artile_id=3487685, version=None):
+    """
+    Download files for a specific figshare article_id and version to the specified directory.


a usage note that it won't really download to the specified dictionary - it will append the version too

gwaybio · 2016-10-10T20:11:46Z

cognoml/utils.py

+    `json.dump` function with `cls=JSONEncoder`.
+    """
+    obj_str = pd.json.dumps(obj)
+    print(obj_str)


making sure you want to print this here

Yeah this main function is currently only for running a single example / test case. Actual users will call the cognoml.analysis.classify() function.

gwaybio · 2016-10-10T20:13:46Z

data/api/hippo-output-schema.json

@@ -0,0 +1,323 @@
+{


I didn't read this file

Does not address "Lasso or Ridge only?"

Do not optimize `l1_ratio`. Instead use the default of 0.15. Search a denser grid for `alpha`. See cognoma#56

dhimmel added 4 commits September 19, 2016 16:18

Begin constructing a MVP machine learner

a49bfe1

Export JSON API input for Hippo pathway

876b813

Ignore __pycache__

4c99168

classify() functioning with mock input

ebc47d7

dhimmel force-pushed the package branch from 2d7d4d5 to ebc47d7 Compare September 20, 2016 18:52

dhimmel added 2 commits September 20, 2016 14:54

Save output corresponding to hippo-input.json

4fc8baa

From the `cognoml` directory, ran: ``` python analysis.py > ../data/api/hippo-output.json ```

Export model information to JSON output

7ef78d5

Also filter zero-variance features.

dhimmel commented Sep 20, 2016

View reviewed changes

Return unselected observations

a050db0

Unselected observations (samples in the dataset that were not selected by the user) are now returned. These observations receive predictions but are missing (-1 encoded) for fields such as `testing` and `status`. Sorted model parameters by key.

dhimmel commented Sep 21, 2016

View reviewed changes

Save grid_search performance metrics

eb1b670

dhimmel force-pushed the package branch from bb764c7 to eb1b670 Compare September 21, 2016 16:19

Move classifier and pipeline to it's own module

5f011f4

dhimmel mentioned this pull request Sep 22, 2016

Simplify API docs for a minimalist design cognoma/core-service#24

Merged

Add setup.py to make module installable

527963b

gwaybio reviewed Sep 26, 2016

View reviewed changes

Review comments: spacing and results doc

28cb22b

dhimmel force-pushed the package branch from e926c5d to 28cb22b Compare September 26, 2016 20:29

dhimmel mentioned this pull request Sep 27, 2016

Could a general mutation-load pattern confound mutation-specific signals? #8

Open

Check whether pipeline has function before calling

d52c6a2

Meant to address https://git.io/vPvtI

dhimmel mentioned this pull request Sep 29, 2016

Finding which features are passed to the final estimator of an sklearn pipeline scikit-learn/scikit-learn#7536

Closed

awm33 mentioned this pull request Sep 30, 2016

ML module API #31

Closed

Acquire data from figshare

9930433

dhimmel mentioned this pull request Sep 30, 2016

Visualizing pre-classifier data cognoma/frontend#13

Closed

dhimmel added 4 commits October 6, 2016 16:35

Update for sklearn 0.18.0, Fix pipeline

4a778d1

Fix pipeline according to: scikit-learn/scikit-learn#7536 (comment) Extract selected feature names according to: scikit-learn/scikit-learn#7536 (comment)

Merge branch 'master' into package

e5a44f0

Used `git checkout --theirs` to resolve conflicts

Semantic improvements of get_feature_df

6961e39

Update API JSON files

ee7733f

dhimmel force-pushed the package branch from c02228c to ee7733f Compare October 7, 2016 15:01

Mention hippo-output-schema.json in docstring

2291a0c

gwaybio requested changes Oct 10, 2016

View reviewed changes

dhimmel added 2 commits October 11, 2016 09:29

Address @gwaygenomics review comments

66df379

Does not address "Lasso or Ridge only?"

Grid search: optimize alpha not l1_ratio

10308e0

Do not optimize `l1_ratio`. Instead use the default of 0.15. Search a denser grid for `alpha`. See cognoma#56

dhimmel changed the title ~~[WIP] Create the cognoml package to implement an MVP API~~ Create the cognoml package to implement an MVP API Oct 11, 2016

gwaybio approved these changes Oct 11, 2016

View reviewed changes

dhimmel merged commit 2cb9f34 into cognoma:master Oct 11, 2016

dhimmel deleted the package branch October 11, 2016 15:05


		results['model'] = utils.model_info(clf_grid.best_estimator_)

		feature_df = utils.get_feature_df(clf_grid.best_estimator_, X.columns)

Create the cognoml package to implement an MVP API #51

Create the cognoml package to implement an MVP API #51

Conversation

dhimmel commented Sep 20, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dhimmel commented Sep 26, 2016

gwaybio left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dhimmel Sep 26, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gwaybio commented Sep 27, 2016

dhimmel commented Oct 10, 2016 • edited Loading

gwaybio left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dhimmel Sep 26, 2016 •

edited

Loading

dhimmel commented Oct 10, 2016 •

edited

Loading