Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom stopping criteria and loss functions #211

Open
fipelle opened this issue Feb 4, 2023 · 9 comments
Open

Custom stopping criteria and loss functions #211

fipelle opened this issue Feb 4, 2023 · 9 comments

Comments

@fipelle
Copy link

fipelle commented Feb 4, 2023

Hi,

I can't seem to find the documentation for creating custom stopping criteria (ideally for ensembles) and loss functions. Could you please point me in the right direction? Thanks!

@ablaom
Copy link
Member

ablaom commented Feb 4, 2023

I'm not sure this is provided by this package, but you can get it using the MLJ wrapper:

IteratedModel docs
MNIST / Flux example of iterative model control

Is this what you're after?

@ablaom
Copy link
Member

ablaom commented Feb 4, 2023

Mmm... I see warm restart has not been implemented for the wrapper, which will make run time very slow for large numbers of iterations. I've posted JuliaAI/MLJDecisionTreeInterface.jl#40 in response.

@fipelle
Copy link
Author

fipelle commented Feb 7, 2023

I am trying to use a random forest classifier with:

  • a custom version of the Gini Impurity,
  • an additional stopping criterion based on some function. [this can be done indirectly by customising the loss]

Similar features are implemented in other packages such as LightGBM. See, for instance, the links below:

I was hoping to be able to do something similar with DecisionTree.jl directly.

@ablaom
Copy link
Member

ablaom commented Feb 7, 2023

Yes, I understand. I just don't think that functionality exists here. I'll leave the issue open, and perhaps someone will add it.

For my part, I'd rather prioritise model-generic solutions to solutions to controlling iterative models, which is what MLJIteration does. That way we avoid a lot of duplication of effort.

@fipelle
Copy link
Author

fipelle commented Feb 10, 2023

@ablaom I think I figured how to do it using native APIs.

In the case of classification trees, this is easy enough. All you need to do it to do something along the lines of build_tree(labels, features, loss=(ns, n)->custom_loss(ns, n, args...)). In the case of the Gini impurity, ns is the vector of cases per class (at leaf level) and n is the number of classes. Of course, correct me if I am wrong.

In the case of a random forest, things are a little more complicated. Is there any way around writing a custom build_forest function with a user-defined loss instead of the default util.entropy(ns, n, entropy_terms)? Regarding changes in the bootstrap samples (when needed), they have to be implemented modifying inds at

inds = rand(_rng, 1:t_samples, n_samples)

and

inds = rand(1:t_samples, n_samples)

I suppose.

EDIT

I think it could be nice to extend DecisionTree or have a small package with more flexible versions of build_forest to accomodate for custom usage. What do you think? I am happy creating a new package if you'd prefer to keeps things separate (a bit like StatsPlots.jl).

@ablaom
Copy link
Member

ablaom commented Feb 12, 2023

In the case of a random forest, things are a little more complicated. Is there any way around writing a custom build_forest function with a user-defined loss instead of the default util.entropy(ns, n, entropy_terms)?

Yes, I also recently discovered that the loss parameter is only exposed for single trees, and not forests. I'd definitely support fixing this and will open an issue.

@fipelle
Copy link
Author

fipelle commented Feb 12, 2023

@ablaom I have almost finished writing a custom implementation that allows for custom bootstrapping as well (e.g., stratified sampling). Do you think it would be best to keep it separate or would you accept a pull request with it as well?

@ablaom
Copy link
Member

ablaom commented Feb 13, 2023

Glad to hear about the progress. I think to reduce the maintenance burden on this package I'd prefer not to add model-generic functionality within the package itself. MLJ and other toolboxes provide for things like stratified resampling.

For example:

using MLJ

X, y = make_blobs(centers=5)

Tree = @load DecisionTreeClassifier pkg=DecisionTree

tree = Tree()

julia> evaluate(
       tree,
       X,
       y,
       resampling=StratifiedCV(nfolds=5),
       measure=LogLoss(),
       )
PerformanceEvaluation object with these fields:
  measure, operation, measurement, per_fold,
  per_observation, fitted_params_per_fold,
  report_per_fold, train_test_rows
Extract:
┌────────────────────────────────┬───────────┬─────────────┬─────────┬──────────────────────────────┐
│ measure                        │ operation │ measurement │ 1.96*SE │ per_fold                     │
├────────────────────────────────┼───────────┼─────────────┼─────────┼──────────────────────────────┤
│ LogLoss(                       │ predict   │ 5.05        │ 2.3     │ [5.41, 7.21, 1.8, 3.6, 7.21] │
│   tol = 2.220446049250313e-16) │           │             │         │                              │
└────────────────────────────────┴───────────┴─────────────┴─────────┴──────────────────────────────┘

I suggest that if MLJ has a feature you're missing that you open an issue there - and maybe even help provide it. The impact will be greater and the maintenance burden lower.

@ablaom
Copy link
Member

ablaom commented Feb 13, 2023

@fipelle When this PR merges you will be able to (efficiently) control early stopping (and more) through the MLJ interface. A RandomForestClassifier example is given in the PR. Another example is this notebook.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants