You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I was working on an application where I wanted perfect prediction in a classification task and found that I was unable to do that with partial_frac = 1.0, which I did not expect. After some investigation it appears that instances are sampled with repetition when constructing forests. As a result, though N samples are included in each individual tree fit, they almost always include duplicates and are missing other values. See e.g.:
I think it would be preferable if sampling was performed without repetition, ensuring that the partial_frac = 1.0 limit is exact. I don't know if this is the standard convention for random forests, though.
I would be happy to contribute a PR if it's agreed that non-repeated sampling is preferred.
Thank you!
The text was updated successfully, but these errors were encountered:
I see from some review that sampling with replacement is standard due to theoretical justification, though I think in practice one might prefer either. I see other libraries make the choice of sampling with replacement an exposed argument - that would be a nice option.
mharradon
changed the title
RandomForests Cannot Overfit when partial_frac=1.0 due to sampling with repetition
RandomForests Cannot Overfit when partial_frac=1.0 due to Sampling with Replacement
Oct 7, 2022
ablaom
changed the title
RandomForests Cannot Overfit when partial_frac=1.0 due to Sampling with Replacement
Add option to resample features at nodes without replacement
Oct 10, 2022
Hello, thanks for the nice package.
I was working on an application where I wanted perfect prediction in a classification task and found that I was unable to do that with
partial_frac = 1.0
, which I did not expect. After some investigation it appears that instances are sampled with repetition when constructing forests. As a result, though N samples are included in each individual tree fit, they almost always include duplicates and are missing other values. See e.g.:https://github.com/JuliaAI/DecisionTree.jl/blob/master/src/regression/main.jl#L104
I think it would be preferable if sampling was performed without repetition, ensuring that the
partial_frac = 1.0
limit is exact. I don't know if this is the standard convention for random forests, though.I would be happy to contribute a PR if it's agreed that non-repeated sampling is preferred.
Thank you!
The text was updated successfully, but these errors were encountered: