Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to resample features at nodes without replacement #192

Open
mharradon opened this issue Oct 6, 2022 · 1 comment
Open

Add option to resample features at nodes without replacement #192

mharradon opened this issue Oct 6, 2022 · 1 comment

Comments

@mharradon
Copy link

Hello, thanks for the nice package.

I was working on an application where I wanted perfect prediction in a classification task and found that I was unable to do that with partial_frac = 1.0, which I did not expect. After some investigation it appears that instances are sampled with repetition when constructing forests. As a result, though N samples are included in each individual tree fit, they almost always include duplicates and are missing other values. See e.g.:

https://github.com/JuliaAI/DecisionTree.jl/blob/master/src/regression/main.jl#L104

julia> rand(1:5, 5)
5-element Vector{Int64}:
 5
 5
 2
 2
 3

I think it would be preferable if sampling was performed without repetition, ensuring that the partial_frac = 1.0 limit is exact. I don't know if this is the standard convention for random forests, though.

I would be happy to contribute a PR if it's agreed that non-repeated sampling is preferred.

Thank you!

@mharradon
Copy link
Author

I see from some review that sampling with replacement is standard due to theoretical justification, though I think in practice one might prefer either. I see other libraries make the choice of sampling with replacement an exposed argument - that would be a nice option.

@mharradon mharradon changed the title RandomForests Cannot Overfit when partial_frac=1.0 due to sampling with repetition RandomForests Cannot Overfit when partial_frac=1.0 due to Sampling with Replacement Oct 7, 2022
@ablaom ablaom changed the title RandomForests Cannot Overfit when partial_frac=1.0 due to Sampling with Replacement Add option to resample features at nodes without replacement Oct 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants