Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decisions required to reach a minimum viable product #44

Open
dhimmel opened this issue Sep 12, 2016 · 6 comments
Open

Decisions required to reach a minimum viable product #44

dhimmel opened this issue Sep 12, 2016 · 6 comments

Comments

@dhimmel
Copy link
Member

dhimmel commented Sep 12, 2016

We're nearing the point where we'll need to implement a machine learning module to execute user queries. We're looking to create a minimum viable product. We can expand functionality later, but for now let's focus on the simplest and most succinct implementation. There are several decisions to make:

  1. Classifier: which classifiers should we support? If we want to support only a single classifier for now, which one?
  2. Predictions: do we want to return probabilities, scores, or class predictions?
  3. Threshold: do we want to report performance measures that depend on a single classification threshold? Or do we want report performance that span thresholds?
  4. Testing: Do we want to use a testing partition in addition to cross-validation? If so, do we refit a model on all observations?
  5. Features Should we include covariates in addition to expression features (see What covariates should we include as features? #21)?
  6. Feature selection: Do we want to perform any feature selection?
  7. Feature extraction: Do we want to perform features extraction, such as PCA (see Integrating dimensionality reduction into the pipeline #43)?

So let's work out these choices, with a focus on simplicity.

@dhimmel
Copy link
Member Author

dhimmel commented Sep 14, 2016

Here are my thoughts:

  1. Classifier: sklearn.linear_model.SGDClassifier with a grid search to find the optimal l1_ratio and alpha. See 2.TCGA-MLexample.ipynb for an example.
  2. Predictions: let's return all three using the following object names probability, score, class under a predictions key. The frontend should handle cases where probability is absent.
  3. Threshold: Both.
  4. Testing: Let's hold out 10% for testing.
  5. Features deferring this decision based on the maturity of What covariates should we include as features? #21.
  6. Feature selection: let's do MAD feature selection to 8000 genes based on @yl565's findings in Median absolute deviation feature selection #22 (comment). This should help speed up fitting the elastic net without too much performance loss.
  7. Feature extraction: deferring this decision based on the maturity of Integrating dimensionality reduction into the pipeline #43.

@gwaygenomics, @yl565, @stephenshank: do you agree?

@gwaybio
Copy link
Member

gwaybio commented Sep 14, 2016

Can you clarify what you mean by number 3?

Or do we want report performance that span thresholds?

Like AUROC?

@dhimmel
Copy link
Member Author

dhimmel commented Sep 14, 2016

By "span thresholds" I'm referring to any measure computed from predicted probabilities/scores, such as AUROC or AUPRC. By "single classification threshold", I'm referring to any measure computed from predicted classes, such as precision, recall, accuracy, or F1 score.

@gwaybio
Copy link
Member

gwaybio commented Sep 14, 2016

got it. Then yes, this all looks good to me

@yl565
Copy link
Contributor

yl565 commented Sep 14, 2016

+1

@htcai
Copy link
Member

htcai commented Sep 18, 2016

Sounds good!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants