Decisions required to reach a minimum viable product #44

dhimmel · 2016-09-12T23:53:37Z

We're nearing the point where we'll need to implement a machine learning module to execute user queries. We're looking to create a minimum viable product. We can expand functionality later, but for now let's focus on the simplest and most succinct implementation. There are several decisions to make:

Classifier: which classifiers should we support? If we want to support only a single classifier for now, which one?
Predictions: do we want to return probabilities, scores, or class predictions?
Threshold: do we want to report performance measures that depend on a single classification threshold? Or do we want report performance that span thresholds?
Testing: Do we want to use a testing partition in addition to cross-validation? If so, do we refit a model on all observations?
Features Should we include covariates in addition to expression features (see What covariates should we include as features? #21)?
Feature selection: Do we want to perform any feature selection?
Feature extraction: Do we want to perform features extraction, such as PCA (see Integrating dimensionality reduction into the pipeline #43)?

So let's work out these choices, with a focus on simplicity.

dhimmel · 2016-09-14T19:06:43Z

Here are my thoughts:

Classifier: sklearn.linear_model.SGDClassifier with a grid search to find the optimal l1_ratio and alpha. See 2.TCGA-MLexample.ipynb for an example.
Predictions: let's return all three using the following object names probability, score, class under a predictions key. The frontend should handle cases where probability is absent.
Threshold: Both.
Testing: Let's hold out 10% for testing.
Features deferring this decision based on the maturity of What covariates should we include as features? #21.
Feature selection: let's do MAD feature selection to 8000 genes based on @yl565's findings in Median absolute deviation feature selection #22 (comment). This should help speed up fitting the elastic net without too much performance loss.
Feature extraction: deferring this decision based on the maturity of Integrating dimensionality reduction into the pipeline #43.

@gwaygenomics, @yl565, @stephenshank: do you agree?

gwaybio · 2016-09-14T19:26:05Z

Can you clarify what you mean by number 3?

Or do we want report performance that span thresholds?

Like AUROC?

dhimmel · 2016-09-14T19:32:04Z

By "span thresholds" I'm referring to any measure computed from predicted probabilities/scores, such as AUROC or AUPRC. By "single classification threshold", I'm referring to any measure computed from predicted classes, such as precision, recall, accuracy, or F1 score.

gwaybio · 2016-09-14T19:32:56Z

got it. Then yes, this all looks good to me

yl565 · 2016-09-14T23:45:19Z

+1

htcai · 2016-09-18T18:42:11Z

Sounds good!

gwaybio mentioned this issue Sep 19, 2016

First attempt at processing covariate information. #46

Closed

dhimmel mentioned this issue Sep 20, 2016

Create the cognoml package to implement an MVP API #51

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decisions required to reach a minimum viable product #44

Decisions required to reach a minimum viable product #44

dhimmel commented Sep 12, 2016

dhimmel commented Sep 14, 2016 •

edited

Loading

gwaybio commented Sep 14, 2016

dhimmel commented Sep 14, 2016

gwaybio commented Sep 14, 2016

yl565 commented Sep 14, 2016

htcai commented Sep 18, 2016

Decisions required to reach a minimum viable product #44

Decisions required to reach a minimum viable product #44

Comments

dhimmel commented Sep 12, 2016

dhimmel commented Sep 14, 2016 • edited Loading

gwaybio commented Sep 14, 2016

dhimmel commented Sep 14, 2016

gwaybio commented Sep 14, 2016

yl565 commented Sep 14, 2016

htcai commented Sep 18, 2016

dhimmel commented Sep 14, 2016 •

edited

Loading