Custom stopping criteria and loss functions #211

fipelle · 2023-02-04T17:31:53Z

Hi,

I can't seem to find the documentation for creating custom stopping criteria (ideally for ensembles) and loss functions. Could you please point me in the right direction? Thanks!

ablaom · 2023-02-04T22:08:04Z

I'm not sure this is provided by this package, but you can get it using the MLJ wrapper:

IteratedModel docs
MNIST / Flux example of iterative model control

Is this what you're after?

ablaom · 2023-02-04T22:14:32Z

Mmm... I see warm restart has not been implemented for the wrapper, which will make run time very slow for large numbers of iterations. I've posted JuliaAI/MLJDecisionTreeInterface.jl#40 in response.

fipelle · 2023-02-07T18:51:22Z

I am trying to use a random forest classifier with:

a custom version of the Gini Impurity,
~~an additional stopping criterion based on some function.~~ [this can be done indirectly by customising the loss]

Similar features are implemented in other packages such as LightGBM. See, for instance, the links below:

https://lightgbm.readthedocs.io/en/v3.3.2/pythonapi/lightgbm.LGBMClassifier.html#lightgbm.LGBMClassifier [custom objective function]
Early stopping with custom metric microsoft/LightGBM#4249 [custom stopping criteria]

I was hoping to be able to do something similar with DecisionTree.jl directly.

ablaom · 2023-02-07T22:40:13Z

Yes, I understand. I just don't think that functionality exists here. I'll leave the issue open, and perhaps someone will add it.

For my part, I'd rather prioritise model-generic solutions to solutions to controlling iterative models, which is what MLJIteration does. That way we avoid a lot of duplication of effort.

fipelle · 2023-02-10T01:08:17Z

@ablaom I think I figured how to do it using native APIs.

In the case of classification trees, this is easy enough. All you need to do it to do something along the lines of build_tree(labels, features, loss=(ns, n)->custom_loss(ns, n, args...)). In the case of the Gini impurity, ns is the vector of cases per class (at leaf level) and n is the number of classes. Of course, correct me if I am wrong.

In the case of a random forest, things are a little more complicated. Is there any way around writing a custom build_forest function with a user-defined loss instead of the default util.entropy(ns, n, entropy_terms)? Regarding changes in the bootstrap samples (when needed), they have to be implemented modifying inds at

DecisionTree.jl/src/classification/main.jl

Line 378 in f57a156

inds = rand(_rng, 1:t_samples, n_samples)

and

DecisionTree.jl/src/classification/main.jl

Line 394 in f57a156

inds = rand(1:t_samples, n_samples)

I suppose.

EDIT

I think it could be nice to extend DecisionTree or have a small package with more flexible versions of build_forest to accomodate for custom usage. What do you think? I am happy creating a new package if you'd prefer to keeps things separate (a bit like StatsPlots.jl).

ablaom · 2023-02-12T22:01:16Z

In the case of a random forest, things are a little more complicated. Is there any way around writing a custom build_forest function with a user-defined loss instead of the default util.entropy(ns, n, entropy_terms)?

Yes, I also recently discovered that the loss parameter is only exposed for single trees, and not forests. I'd definitely support fixing this and will open an issue.

fipelle · 2023-02-12T23:15:27Z

@ablaom I have almost finished writing a custom implementation that allows for custom bootstrapping as well (e.g., stratified sampling). Do you think it would be best to keep it separate or would you accept a pull request with it as well?

ablaom · 2023-02-13T00:15:34Z

Glad to hear about the progress. I think to reduce the maintenance burden on this package I'd prefer not to add model-generic functionality within the package itself. MLJ and other toolboxes provide for things like stratified resampling.

For example:

using MLJ

X, y = make_blobs(centers=5)

Tree = @load DecisionTreeClassifier pkg=DecisionTree

tree = Tree()

julia> evaluate(
       tree,
       X,
       y,
       resampling=StratifiedCV(nfolds=5),
       measure=LogLoss(),
       )
PerformanceEvaluation object with these fields:
  measure, operation, measurement, per_fold,
  per_observation, fitted_params_per_fold,
  report_per_fold, train_test_rows
Extract:
┌────────────────────────────────┬───────────┬─────────────┬─────────┬──────────────────────────────┐
│ measure                        │ operation │ measurement │ 1.96*SE │ per_fold                     │
├────────────────────────────────┼───────────┼─────────────┼─────────┼──────────────────────────────┤
│ LogLoss(                       │ predict   │ 5.05        │ 2.3     │ [5.41, 7.21, 1.8, 3.6, 7.21] │
│   tol = 2.220446049250313e-16) │           │             │         │                              │
└────────────────────────────────┴───────────┴─────────────┴─────────┴──────────────────────────────┘

I suggest that if MLJ has a feature you're missing that you open an issue there - and maybe even help provide it. The impact will be greater and the maintenance burden lower.

ablaom · 2023-02-13T02:41:03Z

@fipelle When this PR merges you will be able to (efficiently) control early stopping (and more) through the MLJ interface. A RandomForestClassifier example is given in the PR. Another example is this notebook.

ablaom mentioned this issue Feb 8, 2023

Add functionality for adding trees to an existing forest #213

Closed

ablaom mentioned this issue Feb 12, 2023

Add support for specifying the loss used in random forests and AdaBoost model #217

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Custom stopping criteria and loss functions #211

Custom stopping criteria and loss functions #211

fipelle commented Feb 4, 2023

ablaom commented Feb 4, 2023

ablaom commented Feb 4, 2023

fipelle commented Feb 7, 2023 •

edited

Loading

ablaom commented Feb 7, 2023

fipelle commented Feb 10, 2023 •

edited

Loading

ablaom commented Feb 12, 2023

fipelle commented Feb 12, 2023

ablaom commented Feb 13, 2023

ablaom commented Feb 13, 2023 •

edited

Loading

Custom stopping criteria and loss functions #211

Custom stopping criteria and loss functions #211

Comments

fipelle commented Feb 4, 2023

ablaom commented Feb 4, 2023

ablaom commented Feb 4, 2023

fipelle commented Feb 7, 2023 • edited Loading

ablaom commented Feb 7, 2023

fipelle commented Feb 10, 2023 • edited Loading

ablaom commented Feb 12, 2023

fipelle commented Feb 12, 2023

ablaom commented Feb 13, 2023

ablaom commented Feb 13, 2023 • edited Loading

fipelle commented Feb 7, 2023 •

edited

Loading

fipelle commented Feb 10, 2023 •

edited

Loading

ablaom commented Feb 13, 2023 •

edited

Loading