Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create the cognoml package to implement an MVP API #51

Merged
merged 20 commits into from
Oct 11, 2016

Conversation

dhimmel
Copy link
Member

@dhimmel dhimmel commented Sep 20, 2016

I'll remove WIP (work in progress) when this pull request is complete.

See #31 for API design discussion.

See #44 for MVP (minimal viable product) feature inclusion decisions.

From the `cognoml` directory, ran:

```
python analysis.py > ../data/api/hippo-output.json
```
Also filter zero-variance features.

results['model'] = utils.model_info(clf_grid.best_estimator_)

feature_df = utils.get_feature_df(clf_grid.best_estimator_, X.columns)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yl565 currently I'm retrieving feature names using X.columns which will get the feature names of X before it enters the pipeline. However, since VarianceThreshold or other feature selection/tranformation steps will alter the feature set, do you know how we can get feature names at the end of the pipeline? In other words, we want the feature names corresponding to clf_grid.best_estimator_.coef_. I searched for like an hour and couldn't figure this out.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unselected observations (samples in the dataset that were not selected
by the user) are now returned. These observations receive predictions
but are missing (-1 encoded) for fields such as `testing` and `status`.

Sorted model parameters by key.
('sample_id', X_whole.index),
('predicted_status', pipeline.predict(X_whole)),
('predicted_score', pipeline.decision_function(X_whole)),
('predicted_prob', pipeline.predict_proba(X_whole)[:, 1]),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yl565 what's the best way to see if a pipeline supports predict_proba? We can upgrade to sklearn 18 once that's released, if that will make things easier.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See d52c6a2 for my solution

@dhimmel
Copy link
Member Author

dhimmel commented Sep 26, 2016

This pull request is ready for some review. Suggesting @yl565 @awm33 @cgreene @gwaygenomics.

The package name is cognoml. Feedback on name welcome.

The package is pip installable, but there are outstanding issues:

  • data management: I'm not sure the best way to have our package access the data. Should you always have to pass a path for the directory that contains the data? If the data doesn't exist in that directory, it can be downloaded.
  • algorithm support: currently only SGDClassifier is supported. It's not clear how we want to modularly support different pipelines. If we want to support different algos, how should this info get passed to this module?
  • feature name bug
  • currently the returned JSON may contain numpy values which break the builtin JSON encoder. Alternatively, cognoml.utils.JSONEncoder should work. However, if writing to the database, maybe we should just sanitize all JSON values before returning.

To run the example, use:

python cognoml/main.py > data/api/hippo-output.json

Copy link
Member

@gwaybio gwaybio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work. I only made a couple minor comments. I can also confirm that the example data/api/hippo-output.json works as expected!


performance = collections.OrderedDict()
for part, df in ('training', obs_train_df), ('testing', obs_test_df):
y_true=df.status
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pep8 spacing updates needed

@@ -111,17 +112,17 @@

# In[9]:

get_ipython().run_cell_magic('time', '', "path = os.path.join('data', 'expression-matrix.tsv.bz2')\nX = pd.read_table(path, index_col=0)")
get_ipython().run_cell_magic('time', '', "path = os.path.join('download', 'expression-matrix.tsv.bz2')\nX = pd.read_table(path, index_col=0)")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if someone were to nbconvert the ipynb file will this script be overwritten?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't be as the upstream changes to 3.TCGA-MLexample_Pathway.ipynb are part of this pull request.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah ok, see it now

X = X_whole.loc[obs_df.sample_id, :]
y = obs_df.status

X_train, X_test, y_train, y_test = train_test_split(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirming that you're deciding not to stratify based on disease too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, stratification by disease could also make sense. Currently, sample/covariate info is not part of this pull request. I think it probably should be added before the first release.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, I was at talk by Olivier Elemento - he was building models for a different purpose (predict immunotherapy responders) but was adjusting for mutation burden as a covariate. We may want to consider checking out his stuff and adjusting for burden too

results['model']['features'] = utils.df_to_datatables(feature_df)

results['observations'] = utils.df_to_datatables(obs_df)
return results
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe in the beginning of this script you can describe what results should look like? I am having some difficulty interpreting what results actually entails and its format

Copy link
Member Author

@dhimmel dhimmel Sep 26, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I generated a JSON schema using (hippo-output-schema.json):

genson --indent=2 data/api/hippo-output.json  > data/api/hippo-output-schema.json

I can add descriptions for each field here. Do you think that's a good solution.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just a link to the hippo-output-schema and the genson command would suffice

@gwaybio
Copy link
Member

gwaybio commented Sep 27, 2016

ok, i think this looks good to me. Although you may want to wait on @yl565 for those specific questions.

@dhimmel
Copy link
Member Author

dhimmel commented Oct 10, 2016

I propose we merge this sooner rather than later. The pull request is already quite large.

@gwaygenomics, @cgreene, @yl565 does one of you want to submit a review for up to 2291a0c?

Copy link
Member

@gwaybio gwaybio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice comments and flow - easy to review. I didn't try running this iteration yet (I believe I tested one previously). But I can run if you're not pressed for time.

I only had some minor fixes and general questions.

pip install --editable .
```

Make sure the `cognoma-machine-learning` environment is activated first, so the `cognoml` is only installed for this environment.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to say this requirement before pip install?

"""
Read data.
"""
v_dir = download_files(directory=data_directory, artile_id=3487685, version=version)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo in article_id

also, is there a reason you are not providing article_id and directory as function arguments?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, is there a reason you are not providing article_id and directory as function arguments?

What do you mean?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the read_data() function has only a single argument: version - you have download_files here accepting two variables data_directory and 3487685 that are blind to the function.

If this was the functionality you are intending, I would recommend doing this instead:

def read_data(directory=data_directory, article_id=3487685, version=None):
    """
    Read data.
    """
    v_dir = download_files(directory=directory, article_id=article_id, version=version)
    ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

data_directory is a module scope variable that should be set using:

cognoml.analysis.data_directory = 'new_path'

Using the default definition of directory=data_directory will break the above functionality.

There is no support for changing article_id currently.

So while I see your point, I think the changes add more complexity without any additional functionality at the moment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gwaygenomics note that:

Python’s default arguments are evaluated once when the function is defined, not each time the function is called (like it is in say, Ruby).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For mutable defaults only?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For mutable defaults only?

No for all defaults. Their value is evaluated upon definition. This leads to an odd behavior for mutable defaults where they can be modified with each function call. Immutables don't have this issue, but are still evaluated upon definition.

y = obs_df.status

X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.1, random_state=0, stratify=y)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, this is always set to 10%?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah unless you have another heuristic that you think is better. Eventually we will probably need to smarten up here and potentially refuse to classify problems with less than a certain number of positives.

#'classify__alpha': 10.0 ** np.arange(-3, 2),
#'classify__l1_ratio': [0.0, 0.1, 0.2, 0.5, 0.8, 0.9, 1.0],
'classify__alpha': 10.0 ** np.arange(-1, 1),
'classify__l1_ratio': [0.0, 1.0],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lasso or Ridge only?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixing the pipeline (#54) really slowed things down because the tranformation steps are now performed separately for each CV fold. Therefore, I really cut down the grid. In retrospect, this grid is probably too small -- I'll increase it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #56 and 10308e0

version_to_url = {d['version']: d['url'] for d in response.json()}
return version_to_url

def download_files(directory, artile_id=3487685, version=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another typo in article_id

str
The version-specific DOI corresponding to the downloaded data.
"""
version_to_url = get_article_versions(artile_id)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

article_id


def download_files(directory, artile_id=3487685, version=None):
"""
Download files for a specific figshare article_id and version to the specified directory.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a usage note that it won't really download to the specified dictionary - it will append the version too

`json.dump` function with `cls=JSONEncoder`.
"""
obj_str = pd.json.dumps(obj)
print(obj_str)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

making sure you want to print this here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this main function is currently only for running a single example / test case. Actual users will call the cognoml.analysis.classify() function.

@@ -0,0 +1,323 @@
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't read this file

Does not address "Lasso or Ridge only?"
Do not optimize `l1_ratio`. Instead use the default of 0.15. Search a denser
grid for `alpha`. See cognoma#56
@dhimmel dhimmel changed the title [WIP] Create the cognoml package to implement an MVP API Create the cognoml package to implement an MVP API Oct 11, 2016
@dhimmel dhimmel merged commit 2cb9f34 into cognoma:master Oct 11, 2016
@dhimmel dhimmel deleted the package branch October 11, 2016 15:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants