-
-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH] Add "extratrees" for oblique trees #46 #75
Conversation
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## main #75 +/- ##
==========================================
+ Coverage 86.73% 86.99% +0.25%
==========================================
Files 26 26
Lines 2111 2153 +42
==========================================
+ Hits 1831 1873 +42
Misses 280 280
☔ View full report in Codecov by Sentry. |
In Scikit-Learn
It's picking rand_uniform between (min feature val, max feature val). But why can we sample |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Good first try. This doesn't work the way intended as of now:
- If you look at the design of the trees/splitter/criterion and Python wrapper, the Python wrapper instantiates each class and then passes them to some Tree Builder. The tree builder calls the API of each class. In https://github.com/neurodata/scikit-learn/blob/a799dbe5feb1a0956c24b10ac1ea27f1eadfb76f/sklearn_fork/tree/_tree.pyx#L505, we see that the way it is called is in line w/ the abstract API defined by
BaseSplitter.node_split
. The current implementation would never be able to be used. - The only difference between extra trees and normal trees is that instead of sorting and choosing the best threshold at each split point, a random threshold is chosen. So: https://github.com/neurodata/scikit-learn/blob/a799dbe5feb1a0956c24b10ac1ea27f1eadfb76f/sklearn_fork/tree/_splitter.pyx#L714-L722 vs https://github.com/neurodata/scikit-learn/blob/a799dbe5feb1a0956c24b10ac1ea27f1eadfb76f/sklearn_fork/tree/_splitter.pyx#L409. Other lOC might be necessary due to the minor change in logic. I.e. we don't need to sort.
- Re above point, rather than evaluate all splits, we don't need to do that anymore. A split point is just randomly chosen. I.e. we don't need to evaluate all split points. https://github.com/neurodata/scikit-learn/blob/a799dbe5feb1a0956c24b10ac1ea27f1eadfb76f/sklearn_fork/tree/_splitter.pyx#L427-L462.
Moreover, the logic is slightly different in the oblique splitter implementation. So you're basically taking the oblique splitter implementation and changing the analogous logic. Lmk if this makes sense. If it helps you when speccing out the high-level changes needed, then you can also write an outline of a code plan, and I can lyk if it's in the right direction.
This requires a bit of reading the code and perhaps writing down how things are moving and such to fully understand the tree implementation.
At a high level, I was thinking of the following:
Am I totally off? |
Yeah more or less. And then some FYI, in case it isn't clear in the code:
No this is more or less in the right direction. The whole point of choosing a threshold at random is so you can i) avoid sorting and ii) avoid computing the criterion for each possible split threshold.
Well, more so that once we pick the threshold yes we have to evaluate it. But see comment above. The main point of random is to prevent the need to evaluate all particular choices of thresholds. This complements the randomness induced by: i) random subsetting of data is fed into each tree, ii) random features are picked for possible splitting on at each node in trees. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copying / pasting your comment in Slack here to leave comments (btw feel free to just comment on GH; it is a lot easier to respond):
I created RandomObliqueSplitter that inherits from BestObliqueSplitter which has sample_proj_mat function. It overwrites the node_split function with 'random' version that picks current_split.threshold at random.
Scikit-learn's random splitter has a slightly different design, but the high-level logic is more or less similar. I would recommend reviewing this function and writing out the logic for yourself on paper.
It ensures min_weight_leaf and min_samples_leaf criteria are met but besides that it does not sort or evaluate other criterion, just once for the randomly chosen threshold. It builds but many test fails and I am clearly not understanding the root cause. Is there some specific consideration needed for testing 'random' version or should all the tests be applicable regardless?
Currently, you are sorting still, so I don't think this is accurate.
I would take a look at the sklearn implementation of best splitter and random splitter and write down the logic to compare. Next I would take a look at our oblique best splitter.
sktree/tree/_oblique_splitter.pyx
Outdated
# Sort the samples | ||
sort(&feature_values[start], &samples[start], end - start) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See here you are still sorting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Draw a random threshold
current_split.threshold = rand_uniform(
min_feature_value,
max_feature_value,
random_state,
)
In node_split_random
in Splitter
, it samples a random (uniform) threshold between min_feature_value
and max_feature_value
as above and find and set .pos
according to it by partitioner. partition_samples. To find
min_feature_valueand
max_feature_value, it uses
partitioner.find_min_max, which runs iteratively each feature value and is basically the same as sorting. I was going to use that to get
min_feature_valueand
max_feature_valuethen follow the same logic as
node_split_random. But I had a question: can we just randomly sample
current_split.posfrom
rand_int(start+min_samples_leaf, end, &self.rand_r_state)`? Is that a bad sampling strategy? Is either better than the other? If the latter is an acceptable sampling method, it'll be faster, (and no sorting).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand the question. You wouldn't randomly sample pos
. Also finding min/max is not the same as sorting. Sorting is O(nlog(n)). Finding min/max is O(n).
The pseudocode for RandomSplitter.node_split is the following. The only difference is that compute_features()
computes a projection of the data, rather than just choosing a column of the data. I think sklearn's random splitter does some extra stuff like keeping track of what columns contain constant values, whereas we don't do that.
# Some checks and stuff are missing, but this is the general gist, which is what's in sklearn's random splitting
for idx in range(max_features):
# for each projection, compute the features over all samples
# (n_samples,) vector
current_feature_values = compute_features_over_samples
# find min and max of current feature values
min, max = find_min_max(current_feature_values)
# draw random thresohld
current_split.threshodl = rand_uniform(min, max, randstate)
# partition samples given the threshold by moving feature values and samples inplace
current_split.pos = partition_samples()
# evaluate split
criterion.reset()
criterion.update(current_split.pos)
current_proxy_improvement = self.criterion.proxy_impurity_improvement()
if current_proxy_improvement > best_proxy_improvement:
best_proxy_improvement = current_proxy_improvement
best = current # copy
# reorganize samples inplace samples[start:best.pos] + samples[best.pos:end]
# copy of the previous code
Possibly some good papers on this topic are the extra trees publication linked in the GH issue and the original random forest paper by Breiman.
Hi @SUKI-O lmk when you feel this is ready for review. It might help if you are able to write a short example using simulated data to show how this is faster than the normal oblique best splitter when building an Oblique forest. It should also ideally be similar in performance. Some example for inspiration: |
Do you think I can/should use the dataset used for ObliqueRF like this to run the test/example? If not, any suggestions on what may work best for this application? |
Sure you can use that dataset. I would also have an example comparing the decision boundaries on the iris dataset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Cython code looks more correct to me now, but make sure you remove any extraneous code that is not being used.
In the examples, these are user-facing documentation that we want to hold to a pretty high standard because it should convey some "concept". In your case, it is comparing "extratrees*" vs "normal trees*" in some classification dataset.
We want to see how much time is saved vs how much performance changes.
Holding these examples to a high standard also helps you because if you ever wonder how to run something, you can refer back to the example and have high-quality sample code that you previously wrote.
Signed-off-by: Adam Li <[email protected]>
Signed-off-by: Adam Li <[email protected]>
I think I fixed most of the CI errors in 338428a. So feel free to just rebuild and run the exps and if we see an improvement that we hypothesize, we can proceed. |
Signed-off-by: Adam Li <[email protected]>
Shows EORF is slower. I think this could be due to the really low sample size you force in the cnae-9, har, lsvt datasets. Perhaps you can increase the sample size and run it. You can build and run it locally to get something that makes sense. |
I've tweeked around to get better results but for wide datasets like cnae-9 and lsvt, EORF is always slower. Does this make sense to you? Yes these make sense. I would add a note explaining that you set the I would read the extra trees paper if that helps to communicate the key points: bias-variance-tradeoff, computational runtime. This is a good SO post too on the topic: https://stats.stackexchange.com/questions/175523/difference-between-random-forest-and-extremely-randomized-trees |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay from the exps, the general code seems fine!
There are a few issues tho to be addressed in order for this to get merged:
- unused code: if code is not used, it should not be committed. If I'm misunderstanding if certain pieces of code are actually used, lmk.
- examples: The examples can be improved: https://output.circle-artifacts.com/output/job/80903942-02eb-4fcd-a7ca-0d25ec870888/artifacts/0/dev/auto_examples/plot_extra_oblique_random_forest.html#sphx-glr-auto-examples-plot-extra-oblique-random-forest-py.
the examples should clearly convey some concepts. Right now, they are reworded versions of other examples. The example should guide the user to learning, or understanding something.
The iris example should clearly convey the differences between normal oblique decision trees and extra oblique decision trees. I.e. the only difference is the randomly chosen thresholds vs picking best thresholds and the trees are still useful.
The cc-18 example should clearly convey the tradeoff between choosing extra vs normal oblique decision trees.
params = { | ||
"max_features": None, | ||
"n_estimators": 50, | ||
"max_depth": 5, | ||
"random_state": random_state, | ||
"n_cv": 10, | ||
"n_repeats": 1, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should add a note here regarding max_depth and the n_estimators being set low to allow the example to run on the CI. You can add some intuition on how these should be set normally.
Signed-off-by: Adam Li <[email protected]>
I will merge this in for now since this PR has the biggest diff and there are other PRs on the queue, but if there are questions, or comments you have, feel free to open up a GH issue. |
Thanks @SUKI-O ! |
Implement #46
Changes proposed in this pull request:
Before submitting
section of the
CONTRIBUTING
docs.Writing docstrings section of the
CONTRIBUTING
docs.After submitting