Add option to resample features at nodes without replacement #192

mharradon · 2022-10-06T22:14:04Z

Hello, thanks for the nice package.

I was working on an application where I wanted perfect prediction in a classification task and found that I was unable to do that with partial_frac = 1.0, which I did not expect. After some investigation it appears that instances are sampled with repetition when constructing forests. As a result, though N samples are included in each individual tree fit, they almost always include duplicates and are missing other values. See e.g.:

https://github.com/JuliaAI/DecisionTree.jl/blob/master/src/regression/main.jl#L104

julia> rand(1:5, 5)
5-element Vector{Int64}:
 5
 5
 2
 2
 3

I think it would be preferable if sampling was performed without repetition, ensuring that the partial_frac = 1.0 limit is exact. I don't know if this is the standard convention for random forests, though.

I would be happy to contribute a PR if it's agreed that non-repeated sampling is preferred.

Thank you!

The text was updated successfully, but these errors were encountered:

mharradon · 2022-10-07T14:51:13Z

I see from some review that sampling with replacement is standard due to theoretical justification, though I think in practice one might prefer either. I see other libraries make the choice of sampling with replacement an exposed argument - that would be a nice option.

mharradon changed the title ~~RandomForests Cannot Overfit when partial_frac=1.0 due to sampling with repetition~~ RandomForests Cannot Overfit when partial_frac=1.0 due to Sampling with Replacement Oct 7, 2022

ablaom added the enhancement label Oct 10, 2022

ablaom changed the title ~~RandomForests Cannot Overfit when partial_frac=1.0 due to Sampling with Replacement~~ Add option to resample features at nodes without replacement Oct 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add option to resample features at nodes without replacement #192

Add option to resample features at nodes without replacement #192

mharradon commented Oct 6, 2022

mharradon commented Oct 7, 2022

Add option to resample features at nodes without replacement #192

Add option to resample features at nodes without replacement #192

Comments

mharradon commented Oct 6, 2022

mharradon commented Oct 7, 2022